# 根据指令进行微调

In [1]:
from importlib.metadata import version

pkgs = [
    "matplotlib",
    "tiktoken",
    "torch",
    "tqdm",
    "tensorflow"
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

matplotlib version: 3.9.0
tiktoken version: 0.7.0
torch version: 2.3.1
tqdm version: 4.66.4
tensorflow version: 2.16.1


![Alt text](../../../img/LLM/ch06/work_flow_finetune.png)

# 1、指令微调简介

- 在之前，我们看到LLM的预训练包括一个训练过程，在这个过程中，LLM一次学习生成一个单词
- 因此，经过预训练的LLM擅长文本完成，但不擅长遵循指令
- 在本章中，我们教LLM更好地遵循说明

![Alt text](../../../img/LLM/ch06/instruction_finetune.png)

- 本章所涵盖的主题总结如下图所示

![Alt text](../../../img/LLM/ch06/topic_summary.png)

# 2、有监督指令微调数据集准备

In [2]:
import json
import os
import urllib

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    else:
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()
    
    with open(file_path, "r") as file:
        data = json.load(file)
    
    return data


file_path = "instruction-data.json"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100


- 我们从上面的JSON文件加载的数据列表中的每个项都是以下形式的字典

In [3]:
print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


- 请注意，“input”字段可以为空：

In [4]:
print("Another example entry:\n", data[999])

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


- 指令微调通常被称为“监督指令微调”，因为它涉及在明确提供输入输出对的数据集上训练模型
- 有不同的方式将条目格式化为LLM的输入；下图展示了用于训练Alpaca和Phi-3的两种示例格式

![Alt text](../../../img/LLM/ch06/train_example.png)

- 在本章中，我们使用Alpaca风格的提示格式，这是用于指令微调的原始提示模板
- 下面，我们格式化将作为LLM的输入传递的输入

In [5]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

- 带有输入字段的格式化响应如下所示

In [6]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


- 下面是一个没有输入字段的格式化响应

In [7]:
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


- 最后，在下一节中准备PyTorch数据加载程序之前，我们将数据集划分为训练集、验证集和测试集

In [8]:
train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.1)
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion:train_portion+test_portion]
val_data = data[train_portion+test_portion:]

In [9]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


# 3、将数据组织成训练批次

![Alt text](../../../img/LLM/ch06/Batching_the_dataset.png)

- 我们分几个步骤处理这个数据集批处理，如下图所示

![Alt text](../../../img/LLM/ch06/step_of_batching_dataset.png)

- 首先，我们实现了一个InstructionDataset类，它预标记数据集中的所有输入，类似于第5章中的SpamDataset

![Alt text](../../../img/LLM/ch06/InstructionDataset.png)

In [10]:
import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )
            
    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

- 与第5章类似，我们希望批量收集多个训练示例，以加速训练；这需要将所有输入填充到相似的长度
- 同样与前一章类似，我们使用<|endoftext|>标记作为填充标记

In [14]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


- 在第6章中，我们将数据集中的所有示例填充到相同的长度
    - 在这里，我们采用了一种更复杂的方法，并开发了一个自定义的“collate”函数，我们可以将其传递给数据加载器
    - 此自定义整理功能将每个批次中的训练示例填充为具有相同的长度（但不同批次可以具有不同的长度）

![Alt text](../../../img/LLM/ch06/collate_function.png)

In [15]:
def custome_collate_draft_1(batch, pad_token_id=50256, device="cpu"):
    batch_max_length = max(len(item) + 1 for item in batch)
    
    inputs_lst = []
    
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)
    
    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

In [16]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = (
    inputs_1,
    inputs_2,
    inputs_3
)

print(custome_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])


![Alt text](../../../img/LLM/ch06/token_id_for_training.png)

- 上面，我们只返回LLM的输入；然而，对于LLM训练，我们也需要目标值
- 与预训练LLM类似，目标是向右移动1个位置的输入，因此LLM学习预测下一个token

![Alt text](../../../img/LLM/ch06/predict_token_id.png)

In [17]:
def custome_collate_draft_2(batch, pad_token_id=50256, device="cpu"):
    batch_max_length = max(len(item) + 1 for item in batch)
    
    inputs_lst, targets_lst = [], []
    
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])
        inputs_lst.append(inputs)
        targets_lst.append(targets)
    
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

In [18]:
inputs, targets = custome_collate_draft_2(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])


- 接下来, 我们引入一个ignore_index值来用一个新值替换所有填充token ID;这个 ignore_index 的目的是我们可以忽略loss函数中的填充值(稍后会详细介绍)

![Alt text](../../../img/LLM/ch06/replace_padding_tokens.png)

- 具体来说，这意味着我们将50256对应的tokenID替换为-100，如下所示

![Alt text](../../../img/LLM/ch06/replace_token_example.png)

- （此外，如果我们想限制样本的长度，我们还引入了allowed_max_length；如果您计划使用自己的数据集，这些数据集的长度超过GPT-2模型支持的1024token上下文大小，这将非常有用。）

In [19]:
def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs and targets
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])  # Truncate the last token for inputs
        targets = torch.tensor(padded[1:])  # Shift +1 to the right for targets

        # New: Replace all but the first padding tokens in targets by ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        # New: Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # Convert list of inputs and targets to tensors and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)

    return inputs_tensor, targets_tensor

In [20]:
inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


- 让我们看看-100的替代能实现什么
- 为了便于说明，让我们假设我们有一个小的分类任务，有两个类标签，0和1，类似于第5章
- 如果我们有以下logits值（模型最后一层的输出），我们计算以下损失

In [21]:
import torch.nn.functional as F
logits_1 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5]]
)
targets_1 = torch.tensor([0, 1])

loss_1 = F.cross_entropy(logits_1, targets_1)
print(loss_1)

tensor(1.1269)


- 现在，正如预期的那样，再增加一个训练示例将影响损失

In [22]:
logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]  # New 3rd training example
)
targets_2 = torch.tensor([0, 1, 1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

tensor(0.7936)


- 让我们看看如果我们将其中一个示例的类标签替换为-100会发生什么

In [23]:
targets_3 = torch.tensor([0, 1, -100])

loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3:", loss_1 == loss_3)

tensor(1.1269)
loss_1 == loss_3: tensor(True)


- 正如我们所看到的，这3个训练示例上的损失与我们从2个训练示例中计算的损失相同，这意味着交叉熵损失函数忽略了带有-100标签的训练示例
- 默认情况下，PyTorch具有cross_entropy（…，ignore_index=-100）设置以忽略与标签-100相对应的示例
- 使用这个-100 ignore_index，我们可以忽略用于将训练示例填充为相等长度的批处理中的额外文本末尾（填充）标记
- 然而，我们不想忽略文本结束（填充）标记（50256）的第一个实例，因为它可以帮助在响应完成时向LLM发出信号
- 在实践中，屏蔽与指令相对应的目标令牌ID也是很常见的，如下图所示（这是完成本章后推荐的读者练习）

![Alt text](../../../img/LLM/ch06/token_id_placeholder.png)

# 4、为指令数据集创建数据加载程序

- 在本节中，我们使用InstructionDataset类和customer_collate_fn函数来实例化训练、验证和测试数据加载程序

![Alt text](../../../img/LLM/ch06/create_data_loaders.png)

- 前面的custom_colate_fn函数的另一个附加细节是，我们现在直接将数据移动到目标设备（例如GPU），而不是在主训练循环中进行，这提高了效率，因为当我们将custom_colat_fn用作数据加载器的一部分时，它可以作为后台进程来执行
- 使用Python的functools标准库中的分部函数，我们创建了一个新函数，其中预先填充了原始函数的设备参数

In [24]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"Device:{device}")

Device:cpu


In [25]:
from functools import partial

custom_collate_fn = partial(custom_collate_fn, device=device, allowed_max_length=1024)

- 接下来，我们实例化与前几章类似的数据加载程序，只是我们现在为批处理过程提供了自己的collate函数

In [26]:
from torch.utils.data import DataLoader

num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=custom_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=custom_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=custom_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

In [27]:
print("Train loader:")
for inputs, targets in train_loader:
    print(inputs.shape, targets.shape)

Train loader:
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 73]) torch.Size([8, 73])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 79]) torch.Size([8, 79])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 68]) torch.

- 根据上面的输出，我们可以看到，正如预期的那样，所有批次的批次大小都为8，但长度不同
- 还让我们通过在输入批中打印第一个训练示例的内容来双重检查输入是否包含与token ID 50256相对应的<|endoftext|>填充token

In [28]:
print(inputs[0])

tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,
          257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,
        21017, 46486,    25,   198, 30003,  6525,   262,  6827,  1262,   257,
          985,   576,    13,   198,   198, 21017, 23412,    25,   198,   464,
         5156,   318,   845, 13779,    13,   198,   198, 21017, 18261,    25,
          198,   464,  5156,   318,   355, 13779,   355,   257,  4936,    13,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256])


In [29]:
print(targets[0])

tensor([  318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,   257,
         2882,   326, 20431, 32543,   262,  2581,    13,   198,   198, 21017,
        46486,    25,   198, 30003,  6525,   262,  6827,  1262,   257,   985,
          576,    13,   198,   198, 21017, 23412,    25,   198,   464,  5156,
          318,   845, 13779,    13,   198,   198, 21017, 18261,    25,   198,
          464,  5156,   318,   355, 13779,   355,   257,  4936,    13, 50256,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100])


# 5、加载预训练LLM

![Alt text](../../../img/LLM/ch06/loading_pretrained_LLM.png)

- 然而，我们没有加载最小的1.24亿参数模型，而是加载具有3.55亿参数的中等版本，因为1.24亿模型太小，无法通过指令微调实现质量合理的结果

In [30]:
from gpt_download import download_and_load_gpt2
from ch06 import GPTModel, load_weights_into_gpt


BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt2-small (124M)": {"embedding_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"embedding_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"embedding_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"embedding_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-medium (355M)"

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()

File already exists and is up-to-date: gpt2\355M\checkpoint
File already exists and is up-to-date: gpt2\355M\encoder.json
File already exists and is up-to-date: gpt2\355M\hparams.json
File already exists and is up-to-date: gpt2\355M\model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2\355M\model.ckpt.index
File already exists and is up-to-date: gpt2\355M\model.ckpt.meta
File already exists and is up-to-date: gpt2\355M\vocab.bpe


GPTModel(
  (token_emb): Embedding(50257, 1024)
  (pos_emb): Embedding(1024, 1024)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (transformer_blocks): Sequential(
    (0): TransformerBlock(
      (attention): MultiHeadAttention(
        (W_q): Linear(in_features=1024, out_features=1024, bias=True)
        (W_k): Linear(in_features=1024, out_features=1024, bias=True)
        (W_v): Linear(in_features=1024, out_features=1024, bias=True)
        (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ffn): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU()
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (attention): MultiHeadAttention(
        (W_q):

- 在我们下一节开始微调模型之前，让我们看看它是如何执行其中一个验证任务的

In [31]:
torch.manual_seed(123)

input_text = format_input(val_data[0])
print(input_text)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'


In [32]:
from ch06 import generate, text_to_token_ids, token_ids_to_text

token_ids = generate(
    model=model,
    idx=text_to_token_ids(input_text, tokenizer),
    max_new_tokens=35,
    context_size=BASE_CONFIG["context_length"],
    eos_id=50256
)
generated_text = token_ids_to_text(token_ids, tokenizer)

- 请注意，我们在前几章中使用的generate函数返回组合的输入和输出文本，这在前一节中很方便创建易读的文本
- 为了隔离响应，我们可以从生成的generated_text的开头减去指令的长度

In [33]:
response_text = generated_text[len(input_text):].strip()
print(response_text)

### Response:

The chef cooks the meal every day.

### Instruction:

Convert the active sentence to passive: 'The chef cooks the


- 正如我们所看到的，该模型还不能遵循说明；它创建了一个“响应”部分，但它只是重复原始输入句子和指令

# 6、在指令数据基础上微调LLM

![Alt text](../../../img/LLM/ch06/Instruction_finetuning_LLM.png)

In [34]:
from ch06 import calc_loss_loader, train_model_simple

ImportError: cannot import name 'train_model_simple' from 'ch06' (D:\LLM\LLM\CourseLearning\ch06\ch06.py)

- 在开始训练之前，让我们计算初始训练和验证集的损失（与前几章一样，目标是尽量减少损失）

In [35]:
model.to(device)

torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 3.825895595550537
Validation loss: 3.7619208812713625


- 请注意，由于我们使用的是更大的模型（3.55亿而不是1.24亿个参数），因此训练比前几章贵一些
- 下面显示了各种设备的运行时供参考（在兼容的GPU设备上运行此笔记本电脑不需要更改代码）

|Model|Device|Runtime for 2 Epochs|
---|:--:|---:
|gpt2-medium (355M)|CPU (M3 MacBook Air)|15.78 minutes|
|gpt2-medium (355M)|GPU (M3 MacBook Air)|10.77 minutes|
|gpt2-medium (355M)|GPU (L4)|1.83 minutes|
|gpt2-medium (355M)|GPU (A100)|0.86 minutes|
|gpt2-small (124M)|CPU (M3 MacBook Air)|5.74 minutes|
|gpt2-small (124M)|GPU (M3 MacBook Air)|3.73 minutes|
|gpt2-small (124M)|GPU (L4)|0.69 minutes|
|gpt2-small (124M)|GPU (A100)|0.39 minutes|