<!--Copyright © ZOMI 适用于[License](https://github.com/Infrasys-AI/AIInfra)版权许可-->

# CODE 01:Qwen3-4B 模型微调(DONE)

> Author by: 康煜

大型语言模型（LLM）的微调技术是将预训练模型适配到特定任务的关键环节。面对不同的数据特性和资源约束，选择合适的微调方法至关重要。

本文将使用**Qwen3-4B**模型作为基础模型，对比全参数微调、LoRA（Low-Rank Adaptation）、Prompt Tuning 和指令微调四种主流技术，分析它们在**效果、效率和数据需求**方面的差异，并探索**数据集类型**（通用/领域/小样本）与微调技术的适配关系。


## 1. 实验设置

> cuda=12.4
</br>
> python=3.12

首先安装必要的库：

 1. 安装 transformer 等基本库，此处的 torch 版本请参考 cuda 版本

In [1]:
#!pip install transformers datasets
#!pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 (请在 torch 官网上自寻寻找符合要求的版本)
#!pip install --upgrade ipywidgets (jupyter)

 2.  对于[unsloth](https://github.com/unslothai/unsloth)的安装请参考如下

In [2]:
try: import torch
except: raise ImportError('Install torch via `pip install torch`')
from packaging.version import Version as V
import re
v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI
if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8"): raise RuntimeError(f"CUDA = {cuda} not supported!")
if   v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
elif v  < V('2.3.0'): x = 'cu{}{}-torch220'
elif v  < V('2.4.0'): x = 'cu{}{}-torch230'
elif v  < V('2.5.0'): x = 'cu{}{}-torch240'
elif v  < V('2.5.1'): x = 'cu{}{}-torch250'
elif v <= V('2.5.1'): x = 'cu{}{}-torch251'
elif v  < V('2.7.0'): x = 'cu{}{}-torch260'
elif v  < V('2.7.9'): x = 'cu{}{}-torch270'
elif v  < V('2.8.0'): x = 'cu{}{}-torch271'
elif v  < V('2.8.9'): x = 'cu{}{}-torch280'
else: raise RuntimeError(f"Torch = {v} too new!")
if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8"): raise RuntimeError(f"CUDA = {cuda} not supported!")
x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")
print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')

pip install --upgrade pip && pip install "unsloth[cu124-ampere-torch260] @ git+https://github.com/unslothai/unsloth.git"


In [3]:
#!pip install --upgrade pip && pip install "unsloth[cu124-ampere-torch260] @ git+https://github.com/unslothai/unsloth.git"
#!pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

**请先下载 unsloth 后再下载 xformers**

3. 对于[xformer](https://facebookresearch.github.io/xformers/)的安装，请如下对比 cuda 版本和 xformer 版本

</br>



| **xformers**                                   | **pytorch**         | **CUDA**              |
|:-----------------------------------------------|:-------------------:|:----------------------|
| v0.0.32post2                                  | torch==2.8.0        | cu118,cu126,cu128     |
| v0.0.31post1                                  | torch==2.7.0        | cu118,cu126,cu128     |
| v0.0.30                                       | torch==2.7.0        | cu118,cu126,cu128     |
| v0.0.29.post3                                 | torch==2.6.0        | cu118,cu124,cu126     |
| 0.0.29.post1,0.0.29,0.0.28.post3              | torch==2.5.1        | cu118,cu121,cu124     |
| 0.0.28.post2                                  | torch==2.5.0        | cu118,cu121,cu124     |
| 0.0.28.post1                                  | torch==2.4.1        | cu118,cu121,cu124     |

</br> 下载方式如下：



In [4]:
#!pip install --no-deps "xformers<0.0.30" trl peft accelerate bitsandbytes
#!pip install --no-deps trl peft accelerate bitsandbytes

# 如果 cuda 版本较新也可以使用下面的 xforme 下载，其他的库直接 pip 安装即可
#!pip install --no-build-isolation --pre -v -U git+https://github.com/facebookresearch/xformers.git@fde5a2fb46e3f83d73e2974a4d12caf526a4203e

In [2]:
import subprocess
import os
#修改环境变量
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [None]:
from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, PromptTuningConfig, TaskType
from datasets import load_dataset
import torch
import warnings
warnings.filterwarnings("ignore")

# 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")

# 加载 Qwen3-4B 模型和 tokenizer
model_name = "Qwen/Qwen3-4B-Instruct-2507"  # 使用 Qwen3-4B 指令微调版本
max_seq_length = 1024  # 最大序列长度
load_in_4bit = False   # 使用 4bit 量化减少显存占用，全量调参时该参数请改为 False

# 使用 Unsloth 优化加载模型
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    trust_remote_code=True  # Qwen 模型需要此参数
)

# 添加 pad_token 以便于批处理
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("模型和分词器加载完成")
print(f"模型参数量: {model.num_parameters()}")

使用设备: cuda
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.4: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L20. Num GPUs = 1. Max memory: 47.503 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

模型和分词器加载完成
模型参数量: 4022468096


## 3. 数据集构建

不同微调技术对数据集格式有不同要求。以下是**指令微调**所需的数据集格式示例：

In [8]:
# 指令微调数据集格式示例
instruction_dataset = [
    {
        "instruction": "判断情感倾向",
        "input": "这部电影的视觉效果很棒，但剧情有些乏味",
        "output": "混合情感：正面评价视觉效果，负面评价剧情",
        "system": "你是一个专业的情感分析助手",
        "history": []
    },
    {
        "instruction": "生成产品描述",
        "input": "智能手机，品牌：Apple，型号：iPhone 15，特点：A17 芯片、4800 万像素相机",
        "output": "Apple iPhone 15 搭载强大的 A17 芯片和 4800 万像素高清相机，提供卓越性能和拍摄体验。",
        "system": "你是一个产品描述生成器",
        "history": []
    },
    {
        "instruction": "翻译成英文",
        "input": "今天天气很好，我们一起去公园吧",
        "output": "The weather is nice today, let's go to the park together.",
        "system": "你是一个翻译助手",
        "history": []
    }
]

# 将示例数据集保存为 JSON 文件
import json
with open("instruction_dataset.json", "w", encoding="utf-8") as f:
    json.dump(instruction_dataset, f, ensure_ascii=False, indent=2)

print("指令微调数据集示例已保存")

指令微调数据集示例已保存


对于**通用文本生成**任务，数据集格式可以更简单：

In [None]:
# 通用文本数据集格式示例
general_dataset = [
    {
        "text": "大型语言模型是人工智能领域的重要突破，它们通过在大量文本数据上进行预训练，学习语言的统计规律和语义表示。"
    },
    {
        "text": "迁移学习使模型能够将在一个任务上学到的知识应用到其他相关任务上，大大减少了数据需求和训练时间。"
    }
]

with open("general_dataset.json", "w", encoding="utf-8") as f:
    json.dump(general_dataset, f, ensure_ascii=False, indent=2)

print("通用文本数据集示例已保存")

通用文本数据集示例已保存


## 4. 数据预处理

我们需要根据不同的微调方法对数据进行相应处理：

In [None]:

from transformers import AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import torch
max_seq_length = 1024
def preprocess_instruction_data(examples):
    """处理指令微调数据，labels 严格 mask 掉 prompt，只对 assistant 部分算 loss"""
    input_ids_list = []
    attention_mask_list = []
    labels_list = []
    for i in range(len(examples["instruction_zh"])):
        instruction = str(examples["instruction_zh"][i])
        input_text = (
            str(examples["input_zh"][i])
            if "input_zh" in examples and examples["input_zh"][i]
            else ""
        )
        output_text = str(examples["output_zh"][i])
        system_text = (
            str(examples["system"][i])
            if "system" in examples and examples["system"][i]
            else ""
        )

        # 构建完整对话
        if system_text:
            text = f"<|im_start|>system\n{system_text}<|im_end|>\n"
        else:
            text = ""
        if input_text:
            user_content = f"{instruction}\n{input_text}"
        else:
            user_content = instruction
        text += f"<|im_start|>user\n{user_content}<|im_end|>\n"
        text += f"<|im_start|>assistant\n{output_text}<|im_end|>"

        # 找到 assistant 的起始分界点
        assistant_start = text.rfind("<|im_start|>assistant")
        prompt = text[:assistant_start]
        response = text[assistant_start:]

        # 分别 tokenize，不加 special tokens
        prompt_enc = tokenizer(prompt, add_special_tokens=False)
        response_enc = tokenizer(response, add_special_tokens=False)

        input_ids = prompt_enc['input_ids'] + response_enc['input_ids']
        attention_mask = prompt_enc['attention_mask'] + response_enc['attention_mask']
        labels = [-100] * len(prompt_enc['input_ids']) + response_enc['input_ids']

        # 截断到 max_seq_length
        input_ids = input_ids[:max_seq_length]
        attention_mask = attention_mask[:max_seq_length]
        labels = labels[:max_seq_length]

        # padding
        pad_len = max_seq_length - len(input_ids)
        input_ids += [tokenizer.pad_token_id] * pad_len
        attention_mask += [0] * pad_len
        labels += [-100] * pad_len

        input_ids_list.append(input_ids)
        attention_mask_list.append(attention_mask)
        labels_list.append(labels)

    return {
        "input_ids": input_ids_list,
        "attention_mask": attention_mask_list,
        "labels": labels_list,
    }


def preprocess_general_data(examples):
    """处理通用文本数据"""
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=True,
        max_length=max_seq_length,
        return_tensors=None
    )



In [4]:
# 加载数据集，由于数据集较大，因此只采用 10000 条数据
# ---指令数据集---
instruction_dataset = load_dataset("silk-road/alpaca-data-gpt4-chinese", split = "train[:10000]")

In [5]:
# ---通用文本数据集---
general_dataset = load_dataset("Blaze7451/Wiki-zh-20250601", split = "train[:10000]")

In [6]:
instruction_dataset[0]

{'instruction_zh': '给出三个保持健康的小贴士。',
 'input_zh': '',
 'output_zh': '1. 饮食要均衡且富有营养：确保你的餐食包含各种水果、蔬菜、瘦肉、全谷物和健康脂肪。这有助于为身体提供必要的营养，使其发挥最佳功能，并有助于预防慢性疾病。2. 经常参加体育锻炼：锻炼对于保持强壮的骨骼、肌肉和心血管健康至关重要。每周至少要进行 150 分钟的中等有氧运动或 75 分钟的剧烈运动。3. 获得足够的睡眠：获得足够的高质量睡眠对身体和心理健康至关重要。它有助于调节情绪，提高认知功能，并支持健康的生长和免疫功能。每晚睡眠目标为 7-9 小时。',
 'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It h

In [7]:
general_dataset[0]

{'title': '数学',
 'text': '欧几里得，西元前三世纪的古希腊数学家，而现在被认为是几何之父，此画为拉斐尔的作品《雅典学院》\n 数学是研究数量、结构以及空间等概念及其变化的一门学科，属于形式科学的一种。数学利用抽象化和逻辑推理，从计数、计算、量度、对物体形状及运动的观察发展而成。数学家们拓展这些概念，以公式化新的猜想，以及从选定的公理及定义出发，严谨地推导出一些定理。\n 纯粹数学的知识与运用是生活中不可或缺的一环。对数学基本概念的完善，早在古埃及、美索不达米亚及古印度历史上的古代数学文本便可观见，而在古希腊那里有更为严谨的处理。从那时开始，数学的发展便持续不断地小幅进展，至 16 世纪的文艺复兴时期，因为新的科学发现和数学革新两者的交互，致使数学的加速发展，直至今日。数学并成为许多国家及地区的教育中的一部分。\n 数学在许多领域都有应用，包括科学、工程、医学、经济学和金融学等。数学对这些领域的应用通常被称为应用数学，有时亦会激起新的数学发现，并导致全新学科的发展，例如物理学的实质性发展中建立的某些理论激发数学家对于某些问题的不同角度的思考。数学家也研究纯粹数学，就是数学本身的实质性内容，而不以任何实际应用为目标。许多研究虽然以纯粹数学开始，但其过程中也发现许多可用之处。\n\n**词源**\n 西方语言中“数学”（μαθηματικά）一词源自于古希腊语的μάθημα（máthēma），其有“学习”、“学问”、“科学”，还有个较狭义且技术性的意思－「数学研究」，即使在其语源内。其形容词μαθηματικός（mathēmatikós），意思为「和学习有关的」或「用功的」，亦会被用来指「数学的」。其在英语中表面上的复数形式，及在法语中的表面复数形式 les mathématiques，可溯至拉丁文的中性复数 mathematica，由西塞罗译自希腊文复数τα μαθηματικά（ta mathēmatiká），此一希腊语被亚里士多德拿来指「万物皆数」的概念。\n 汉字表示的「数学」一词大约产生于中国宋元时期。多指象数之学，但有时也含有今天上的数学意义，例如，秦九韶的《数学九章》（《永乐大典》记，即《数书九章》也被宋代周密所着的《癸辛杂识》记为《数学大略》）、《数学通轨》（明代柯尚迁着）、《数学钥》（清代杜知耕着）、《数学拾遗》（清代丁取忠撰）

In [10]:
# 应用预处理
tokenized_dataset = general_dataset.map(
    preprocess_general_data,
    batched=True,
    remove_columns=general_dataset.column_names
)

# 分割训练集和验证集
split_dataset = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

print(f"训练集大小: {len(train_dataset)}")
print(f"验证集大小: {len(eval_dataset)}")

训练集大小: 8000
验证集大小: 2000


In [11]:
# 应用预处理
instruction_tokenized_dataset = instruction_dataset.map(
    preprocess_instruction_data,
    batched=True,
    remove_columns=instruction_dataset.column_names
)

# 分割训练集和验证集
instruction_split_dataset = instruction_tokenized_dataset.train_test_split(test_size=0.2)
instruction_train_dataset = instruction_split_dataset["train"]
instruction_eval_dataset = instruction_split_dataset["test"]

print(f"训练集大小: {len(train_dataset)}")
print(f"验证集大小: {len(eval_dataset)}")

训练集大小: 8000
验证集大小: 2000


## 3. 全参数微调

全参数微调通过**反向传播算法更新模型的所有可训练参数**。其数学本质可以表示为：

θ_min = argmin_θ (1/N) * Σ_{i=1}^N L(f_θ(x_i), y_i)

其中 f_θ表示参数化模型，L 为损失函数，N 为样本数量。

这种方法的主要优势是能够充分利用所有模型参数进行任务适配，但缺点是**计算成本高**，对于大模型来说需要大量的显存和计算资源。

In [None]:
# 设置训练参数
from transformers import TrainingArguments, Trainer
import pandas as pd
import time
import math
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()


training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=2,  # 较小的批大小以适应显存
    per_device_eval_batch_size=2,
    eval_strategy="epoch",
    logging_dir="./results/logs",
    logging_steps=10,
    learning_rate=5e-6,  # 全参数微调使用较小的学习率
    weight_decay=0.01,
    save_steps=1000,
    report_to="none",
    save_total_limit=1, #为了节约内存，每次只保留一个 checkpoint 模型
    bf16=True  # 使用混合精度训练节省显存，这里如果用了 unsloth，默认 bp16，如果此处选择用 fp16，则会报错
)

# 创建 Trainer 实例
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer
)

# 开始训练
train_start = time.time()
trainer.train()
train_end = time.time()

output_dir = "./finetune_results/full_fintune"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print("全参数微调完成")
train_duration = train_end - train_start

# 提取评估记录
import math
train_logs = trainer.state.log_history
eval_results = {}

for log in train_logs:
    if 'eval_loss' in log:
        eval_results.update({k: v for k, v in log.items() if k.startswith('eval_')})
        
# 计算困惑度
if "eval_loss" in eval_results and eval_results["eval_loss"] is not None:
    perplexity = math.exp(eval_results["eval_loss"])
else:
    perplexity = None


total_params = sum(p.numel() for p in model.parameters()) / 1_000_000
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 1_000_000

trainable_ratio = (trainable_params / total_params) * 100
peak_memory_mb = torch.cuda.max_memory_allocated() / 1024**2 if torch.cuda.is_available() else None


record = {
    "fine_tune_method": "full_finetune",
    "train_time_sec": round(train_duration, 2),
    "eval_time_sec": round(eval_results["eval_runtime"], 2),
    "total_params": total_params,
    "trainable_params": trainable_params,
    "trainable_ratio": trainable_ratio,
    "peak_gpu_mem": round(peak_memory_mb, 2) if peak_memory_mb else None,
    "eval_loss": eval_results["eval_loss"],
    "eval_runtime": eval_results["eval_runtime"],
    "eval_samples_per_second": eval_results["eval_samples_per_second"],
    "perplexity": perplexity,
    "epoch": 2
}

csv_path = "./training_results.csv"
df = pd.DataFrame([record])
df.to_csv(csv_path, index=False, mode="a", header=not pd.io.common.file_exists(csv_path))
df

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8,000 | Num Epochs = 2 | Total steps = 16,000
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
 "-____-"     Trainable parameters = 3,633,511,936 of 4,022,468,096 (90.33% trained)


Epoch,Training Loss,Validation Loss
1,1.1833,1.429914
2,1.1601,1.429399


全参数微调完成


全参数微调的主要优点是能够**充分利用模型的全部能力**，通常在数据充足的情况下能达到最佳性能。然而，它的计算成本非常高——对于 Qwen3-4B 这样的模型，需要大量的 GPU 显存和计算时间。

此外，全参数微调还容易导致**灾难性遗忘**，即模型在适应新任务时丢失了预训练中获得的一般知识。

## 4. LoRA 微调

LoRA 是一种**参数高效微调**（PEFT）技术，其核心思想是通过**低秩分解**来限制可训练参数的数量。具体而言，LoRA 将权重更新矩阵ΔW 分解为两个低秩矩阵的乘积：

W + ΔW = W + BA

其中 W 是预训练权重矩阵，A ∈ R^{r×d}和 B ∈ R^{d×r}是低秩矩阵，r 是秩（r << d）。

这种分解的数学基础是**奇异值分解**（SVD）定理，该定理表明任何矩阵都可以被分解为奇异值和奇异向量的乘积，而低秩近似则保留了矩阵中最重要的信息。如下所示：一个大型 N⨉M 矩阵 X 可以用两个矩形矩阵的乘积来近似，


![](./images/Code01DataRealtion01.png)

在训练过程中，LoRA 对 A 使用随机高斯初始化，对 B 使用零初始化，这意味着在训练开始时，ΔW 为零。然后，您只需使用新数据集对模型进行微调，并使用 W +ΔW 作为新的权重矩阵。

通过专注于**低秩更新**，LoRA 与传统微调方法相比，大幅降低了计算和内存开销。而且通过使用 LoRA，可以针对其他的数据集，在同一个基础权重矩阵 W 上进行多种不同的微调。通常只需存储基础矩阵一次，并将新的变体存储为不同的 A_i 和 B_i 的集合就可以完成该任务，如下所示：

![](./images/Code01DataRealtion02.png)

由于 LoRA 可以大幅减少开销，对硬件的需求也随之减少，因此许多研究基于 LoRA 衍生出许多的变体，比如 AdaLoRA，DLoRA 等，此处我们只用最原始的 LoRA 来进行我们的实验。




In [None]:
import gc
# 清理 GPU 内存
torch.cuda.empty_cache()
gc.collect()

In [None]:
from math import e
import time
import torch
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
from unsloth import FastLanguageModel, FastModel
from peft import LoraConfig, get_peft_model, PeftModel
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

model_name = "unsloth/Qwen3-4B-Instruct-2507"
max_seq_length = 1024
load_in_4bit = True 

model, tokenizer = FastModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    load_in_8bit = False, 
    full_finetuning = False, 
    trust_remote_code=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    max_seq_length = max_seq_length,
    use_rslora = False, 
    loftq_config = None,
)



training_args = TrainingArguments(
    output_dir="./results/lora_results",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy="epoch",
    learning_rate=2e-4, #由于是高效参数微调，所以学习率可以适当调大
    weight_decay=0.01,
    report_to="none",
    save_strategy = "steps",
    save_steps = 1000,  # 每 1000 步保存一次
    remove_unused_columns=False,
)


lora_trainer = SFTTrainer(
    model = model,
    train_dataset = train_dataset,
    tokenizer = tokenizer,
    eval_dataset = eval_dataset,
    args = training_args
)

train_start = time.time()
lora_trainer.train()

train_end = time.time()
train_duration = train_end - train_start

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.4: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L20. Num GPUs = 1. Max memory: 47.503 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Qwen3 does not support SDPA - switching to fast eager.
Unsloth: Making `model.base_model.model.model` require gradients


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Epoch,Training Loss,Validation Loss
1,1.4866,1.476965
2,1.2439,1.486912


Unsloth: Will smartly offload gradients to save VRAM!


In [None]:
lora_adapter_dir = "./finetune_results/lora_adapter"
lora_trainer.save_model(lora_adapter_dir)
tokenizer.save_pretrained(lora_adapter_dir)
print(f"LoRA Adapter 已保存到 {lora_adapter_dir}")

base_model_full, _ = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=False,  
    trust_remote_code=True
)

merged_model = PeftModel.from_pretrained(base_model_full, lora_adapter_dir)
merged_model = merged_model.merge_and_unload()

merged_dir = "./finetune_results/lora_merged"
merged_model.save_pretrained(merged_dir)
tokenizer.save_pretrained(merged_dir)
print(f"合并后的完整模型已保存到 {merged_dir}")


LoRA Adapter 已保存到 /root/autodl-tmp/finetune_results/lora_adapter
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.4: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L20. Num GPUs = 1. Max memory: 47.503 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

合并后的完整模型已保存到 /root/autodl-tmp/finetune_results/lora_merged


In [None]:
# 提取评估记录
import math
train_logs = lora_trainer.state.log_history
eval_results = {}

for log in train_logs:
    if 'eval_loss' in log:
        eval_results.update({k: v for k, v in log.items() if k.startswith('eval_')})
        
# 计算困惑度
if "eval_loss" in eval_results and eval_results["eval_loss"] is not None:
    perplexity = math.exp(eval_results["eval_loss"])
else:
    perplexity = None


total_params = sum(p.numel() for p in merged_model.parameters()) / 1_000_000
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 1_000_000

trainable_ratio = (trainable_params / total_params) * 100
peak_memory_mb = torch.cuda.max_memory_allocated() / 1024**2 if torch.cuda.is_available() else None



record = {
    "fine_tune_method": "LoRA",
    "train_time_sec": round(train_duration, 2),
    "eval_time_sec": round(eval_results["eval_runtime"], 2),
    "total_params": total_params,
    "trainable_params": trainable_params,
    "trainable_ratio": trainable_ratio,
    "peak_gpu_mem": round(peak_memory_mb, 2) if peak_memory_mb else None,
    "eval_loss": eval_results["eval_loss"],
    "eval_runtime": eval_results["eval_runtime"],
    "eval_samples_per_second": eval_results["eval_samples_per_second"],
    "perplexity": perplexity,
    "epoch": 2
}

csv_path = "./training_results.csv"
df = pd.DataFrame([record])
df.to_csv(csv_path, index=False, mode="a", header=not pd.io.common.file_exists(csv_path))
df

Unnamed: 0,fine_tune_method,train_time_sec,eval_time_sec,total_params,trainable_params,trainable_ratio,peak_gpu_mem,eval_loss,eval_runtime,eval_samples_per_second,perplexity,epoch
0,LoRA,7188.79,273.1,4022.468096,33.030144,0.821141,14996.35,1.486912,273.1045,7.323,4.176187,2


LoRA 的主要优势在于：
1.  **参数效率**：只需要训练极少量参数（通常小于原模型参数的 1%）
2.  **内存友好**：大幅降低显存需求，使得在消费级 GPU 上微调大模型成为可能
3.  **模块化**：可以为不同任务训练多个适配器，然后灵活切换

实验表明，LoRA 能够保持原始模型大部分性能，同时显著减少训练时间和计算资源需求。

## 5. Prompt 微调

Prompt Tuning 是一种**轻量级微调方法**，它在输入层插入**可训练的虚拟令牌**（virtual tokens），而保持预训练模型的参数不变。这些虚拟令牌作为连续提示，引导模型更好地执行特定任务。

形式上，Prompt Tuning 将原始输入 x 转换为模板化提示 x'，通过构造映射函数 P: X → X'来实现。 其原理图如下所示：


![](./images/Code01DataRealtion03.png)


In [28]:
import gc
# 清理 GPU 内存
torch.cuda.empty_cache()
gc.collect()

4

In [None]:
from math import e
import time
import torch
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
from unsloth import FastLanguageModel, FastModel
from peft import LoraConfig, get_peft_model, PeftModel, PromptTuningConfig, TaskType
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

model_name = "unsloth/Qwen3-4B-Instruct-2507"
max_seq_length = 1024
load_in_4bit = True 

model, tokenizer = FastModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    load_in_8bit = False, 
    full_finetuning = False, 
    trust_remote_code=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
# 配置 Prompt Tuning 参数  
prompt_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20,  # 虚拟令牌数量
    tokenizer_name_or_path=model_name
)

# 创建 Prompt Tuning 模型
model = get_peft_model(model, prompt_config)
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="./results/prompt_tuning_results",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy="epoch",
    learning_rate=2e-4,
    weight_decay=0.01,
    report_to="none",
    save_total_limit=1,
    remove_unused_columns=False,
)

prompt_trainer = SFTTrainer(
    model = model,
    train_dataset = train_dataset,
    tokenizer = tokenizer,
    eval_dataset = eval_dataset,
    args = training_args
)

train_start = time.time()
prompt_trainer.train()

train_end = time.time()
train_duration = train_end - train_start


Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.4: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L20. Num GPUs = 1. Max memory: 47.503 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Qwen3 does not support SDPA - switching to fast eager.
trainable params: 51,200 || all params: 4,022,519,296 || trainable%: 0.0013


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Epoch,Training Loss,Validation Loss
1,1.6958,1.631551
2,1.6899,1.612549


Unsloth: Will smartly offload gradients to save VRAM!


In [None]:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch
from pathlib import Path
from peft import PeftModel
import math

prompt_adapter_dir = "./finetune_results/prompt_adapter"
prompt_trainer.save_model(prompt_adapter_dir)
tokenizer.save_pretrained(prompt_adapter_dir)
print(f"prompt tuning adapter 已保存到 {prompt_adapter_dir}")


base_model_full, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=False,
    trust_remote_code=True
)


peft_model = PeftModel.from_pretrained(base_model_full, prompt_adapter_dir)

if hasattr(peft_model, 'merge_and_unload'):
    merged_model = peft_model.merge_and_unload()
else:
    merged_model = peft_model

# Path(merged_dir).mkdir(parents=True, exist_ok=True)
merged_model.save_pretrained(merged_dir)
tokenizer.save_pretrained(merged_dir)

merged_dir = "./finetune_results/prompt_merged"
merged_model.save_pretrained(merged_dir)
tokenizer.save_pretrained(merged_dir)
print(f"prompt tuning 训练合并后的完整模型已保存到 {merged_dir}")



prompt tuning adapter 已保存到 /root/autodl-tmp/finetune_results/prompt_adapter
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.4: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L20. Num GPUs = 1. Max memory: 47.503 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

prompt tuning 训练合并后的完整模型已保存到 /root/autodl-tmp/finetune_results/prompt_merged


In [None]:
# 提取评估记录

train_logs = prompt_trainer.state.log_history
eval_results = {}

for log in train_logs:
    if 'eval_loss' in log:
        eval_results.update({k: v for k, v in log.items() if k.startswith('eval_')})
        
# 计算困惑度
if "eval_loss" in eval_results and eval_results["eval_loss"] is not None:
    perplexity = math.exp(eval_results["eval_loss"])
else:
    perplexity = None


total_params = sum(p.numel() for p in merged_model.parameters()) / 1_000_000
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 1_000_000

trainable_ratio = (trainable_params / total_params) * 100
peak_memory_mb = torch.cuda.max_memory_allocated() / 1024**2 if torch.cuda.is_available() else None


record = {
    "fine_tune_method": "Prompt Tuning",
    "train_time_sec": round(train_duration, 2),
    "eval_time_sec": round(eval_results["eval_runtime"], 2),
    "total_params": total_params,
    "trainable_params": trainable_params,
    "trainable_ratio": trainable_ratio,
    "peak_gpu_mem": round(peak_memory_mb, 2) if peak_memory_mb else None,
    "eval_loss": eval_results["eval_loss"],
    "eval_runtime": eval_results["eval_runtime"],
    "eval_samples_per_second": eval_results["eval_samples_per_second"],
    "perplexity": perplexity,
    "epoch": None
}

csv_path = "./training_results.csv"
df = pd.DataFrame([record])
df.to_csv(csv_path, index=False, mode="a", header=not pd.io.common.file_exists(csv_path))
df

Unnamed: 0,fine_tune_method,train_time_sec,eval_time_sec,total_params,trainable_params,trainable_ratio,peak_gpu_mem,eval_loss,eval_runtime,eval_samples_per_second,perplexity,epoch
0,Prompt Tuning,6092.99,261.0,4022.519296,0.0512,0.001273,33340.61,1.612549,260.9954,7.663,5.01558,


Prompt Tuning 的优势在于：
1.  **极高的参数效率**：只需要训练极少量的参数（仅虚拟令牌对应的参数）
2.  **避免灾难性遗忘**：由于原始模型参数被冻结，预训练知识得到保留
3.  **多任务学习**：可以为不同任务学习不同的提示，然后共享同一基础模型

Prompt Tuning 特别适合**少样本学习**场景，但在复杂任务上可能性能不如其他方法。

## 6. 指令微调

指令微调是**监督微调**（SFT）的一种形式，它使用**标注的输入-输出对**进行有监督训练，损失函数通常采用交叉熵（语言建模目标）。与全参数微调不同，指令微调通常专注于使模型遵循指令和完成特定任务格式。

指令微调的核心思想是通过高质量的指令-回答对来训练模型，使其能够更好地理解和遵循人类指令。具体过程如下所示：

![](./images/Code01DataRealtion04.png)

In [None]:
from math import e
import time
import torch
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
from unsloth import FastLanguageModel, FastModel
from peft import LoraConfig, get_peft_model, PeftModel, TaskType
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments, Trainer
from datasets import load_dataset


model_name = "unsloth/Qwen3-4B-Instruct-2507"
load_in_4bit = True 
max_seq_length = 1024


model, tokenizer = FastModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    load_in_8bit = False, 
    full_finetuning = False, 
    trust_remote_code=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


# 配置参数用于指令微调
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

training_args = TrainingArguments(
    output_dir="./results/instruction_tuning_results",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy="epoch",
    learning_rate=2e-4,
    weight_decay=0.01,
    report_to="none",
    save_strategy = "steps",
    save_steps = 1000,  # 每 1000 步保存一次
    remove_unused_columns=False,
)

instruct_trainer = SFTTrainer(
    model = model,
    train_dataset = instruction_train_dataset,
    tokenizer = tokenizer,
    eval_dataset = instruction_eval_dataset,
    args = training_args
)

train_start = time.time()
instruct_trainer.train()

train_end = time.time()
train_duration = train_end - train_start

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.4: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L20. Num GPUs = 1. Max memory: 47.503 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Qwen3 does not support SDPA - switching to fast eager.
Unsloth: Making `model.base_model.model.model` require gradients


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Epoch,Training Loss,Validation Loss
1,1.4838,1.473591
2,1.2384,1.48207


Unsloth: Will smartly offload gradients to save VRAM!


In [None]:
instruct_adapter_dir = "./finetune_results/instruction_adapter"
instruct_trainer.save_model(instruct_adapter_dir)
tokenizer.save_pretrained(instruct_adapter_dir)
print(f"Instruction Tuning adapter 已保存到 {instruct_adapter_dir}")

base_model_full, _ = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=False,  
    trust_remote_code=True
)

merged_model = PeftModel.from_pretrained(base_model_full, instruct_adapter_dir)
merged_model = merged_model.merge_and_unload()

merged_dir = "./finetune_results/instruction_merged"
merged_model.save_pretrained(merged_dir)
tokenizer.save_pretrained(merged_dir)
print(f"合并后的完整模型已保存到 {merged_dir}")


Instruction Tuning adapter 已保存到 /root/autodl-tmp/finetune_results/instruction_adapter
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.4: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L20. Num GPUs = 1. Max memory: 47.503 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

合并后的完整模型已保存到 /root/autodl-tmp/finetune_results/instruction_merged


In [None]:
# 提取评估记录
import math
train_logs = instruct_trainer.state.log_history
eval_results = {}

for log in train_logs:
    if 'eval_loss' in log:
        eval_results.update({k: v for k, v in log.items() if k.startswith('eval_')})
        
# 计算困惑度
if "eval_loss" in eval_results and eval_results["eval_loss"] is not None:
    perplexity = math.exp(eval_results["eval_loss"])
else:
    perplexity = None


total_params = sum(p.numel() for p in merged_model.parameters()) / 1_000_000
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 1_000_000

trainable_ratio = (trainable_params / total_params) * 100
peak_memory_mb = torch.cuda.max_memory_allocated() / 1024**2 if torch.cuda.is_available() else None


record = {
    "fine_tune_method": "Instruction Tuning",
    "train_time_sec": round(train_duration, 2),
    "eval_time_sec": round(eval_results["eval_runtime"], 2),
    "total_params": total_params,
    "trainable_params": trainable_params,
    "trainable_ratio": trainable_ratio,
    "peak_gpu_mem": round(peak_memory_mb, 2) if peak_memory_mb else None,
    "eval_loss": eval_results["eval_loss"],
    "eval_runtime": eval_results["eval_runtime"],
    "eval_samples_per_second": eval_results["eval_samples_per_second"],
    "perplexity": perplexity,
    "epoch": 2
}

csv_path = "./training_results.csv"
df = pd.DataFrame([record])
df.to_csv(csv_path, index=False, mode="a", header=not pd.io.common.file_exists(csv_path))
df

Unnamed: 0,fine_tune_method,train_time_sec,eval_time_sec,total_params,trainable_params,trainable_ratio,peak_gpu_mem,eval_loss,eval_runtime,eval_samples_per_second,perplexity,epoch
0,Instruction Tuning,7426.25,300.06,4022.468096,33.030144,0.821141,18894.36,1.48207,300.0572,6.665,4.402047,2


指令微调的优势包括：

1.  **任务特异性**：能够使模型更好地适应特定任务格式和指令
2.  **数据效率**：通常比全参数微调需要更少的数据
3.  **可组合性**：可以与 LoRA 等参数高效方法结合使用

然而，指令微调**依赖高质量标注数据**，如果指令-回答对质量不高，可能会限制模型性能。

## 7. 实验结果与分析

为了评估不同微调方法的性能，我们需要在测试集上计算模型的困惑度（perplexity）或任务特定指标，前面在运行的时候我们已经记录了相关的数据：

In [2]:
import pandas as pd

experiment_results = pd.read_csv("./training_results.csv")
experiment_results

Unnamed: 0,fine_tune_method,train_time_sec,eval_time_sec,total_params,trainable_params,trainable_ratio,peak_gpu_mem,eval_loss,eval_runtime,eval_samples_per_second,perplexity,epoch
0,Full Finetune,9882.32,219.86,4022.468096,3633.511936,90.33041,35635.54,1.429399,219.8613,9.097,4.176187,2.0
1,LoRA,7188.79,273.1,4022.468096,33.030144,0.821141,14996.35,1.486912,273.1045,7.323,4.176187,2.0
2,Prompt Tuning,6092.99,261.0,4022.519296,0.0512,0.001273,33340.61,1.612549,260.9954,7.663,5.01558,2.0
3,Instruction Tuning,7426.25,300.06,4022.468096,33.030144,0.821141,18894.36,1.48207,300.0572,6.665,4.402047,2.0


#### **结果分析**
根据表格的结果，我们可以得出以下结论：
</br>
- Full Finetune：全部参数可训练，资源消耗最高（显存和 GPU 需求最大，trainable_ratio 高达 90%），训练时间最长，但在精度（perplexity 最低 4.18）上表现最佳。
</br>
- LoRA & Instruction Tuning：两者可训练参数和比例相等（只训练约 0.8%参数），显存消耗和训练时间较低，精度与 Full Finetune 非常接近（perplexity 分别为 4.18 和 4.40），但资源效率更高。
</br>
- Prompt Tuning：可训练参数极少（仅 0.001%），训练和评估速度最快，资源消耗最低，但精度相对欠佳（perplexity 为 5.02, 明显高于其他三种）。




因此，结合理论和实验的结果，我们可以总结出数据集特性与微调技术的适配关系：

| **微调方法** | **数据需求** | **计算效率** | **适合场景** | **实现难度** |
|------------|------------|------------|------------|------------|
| **全参数微调** | 大量高质量数据 | 低 | 数据充足且与预训练数据相似度高 | 中等 |
| **LoRA** | 中等规模数据 | 高 | 计算资源有限，需要快速适配 | 低 |
| **Prompt Tuning** | 少样本学习 | 极高 | 数据稀缺，需要快速部署 | 低 |
| **指令微调** | 高质量指令-回答对 | 中等 | 任务特定格式和指令遵循 | 中等 |

具体来说：

1.  **数据量少，数据相似度高**：适合 Prompt Tuning 或 LoRA，只需要修改最后几层或添加少量参数。

2.  **数据量少，数据相似度低**：适合 LoRA 或 Adapter 方法，可以冻结预训练模型的初始层，只训练较高层。

3.  **数据量大，数据相似度低**：考虑全参数微调或领域自适应预训练（DAPT），但由于数据差异大，可能需要更多训练时间。

4.  **数据量大，数据相似度高**：全参数微调通常能获得最佳性能，这是最理想的情况。

## 8. 总结与思考

在实际应用中，选择微调技术时需要综合考虑数据特性（数量、质量、与预训练数据的相似度）、计算资源约束、任务要求和部署环境等因素。对于大多数实际应用场景，**LoRA**提供了最佳的权衡，而**Prompt Tuning**则在极端资源约束或数据稀缺环境下更具优势。

## 参考文献

1. Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
2. Liu, X., et al. (2019). Multi-Task Deep Neural Networks for Natural Language Understanding. arXiv:1901.11504.
3. Shin, T., et al. (2020). AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. arXiv:2010.15980.
4. https://synthesis.ai/2024/08/13/fine-tuning-llms-rlhf-lora-and-instruction-tuning/
5. Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
6. Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.