<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
<a href="http://mng.bz/orYv">《从零开始构建大型语言模型》</a>一书的补充代码，作者：<a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>代码仓库：<a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# 第5章练习解答

In [1]:
from importlib.metadata import version

pkgs = ["numpy", 
        "tiktoken", 
        "torch",
        "tensorflow" # 用于OpenAI的预训练权重
       ]
for p in pkgs:
    print(f"{p} version: {version(p)}")

numpy version: 2.0.2
tiktoken version: 0.9.0
torch version: 2.7.1
tensorflow version: 2.19.0


# 练习5.1：温度缩放的softmax分数和采样概率

- 我们可以使用本节中定义的`print_sampled_tokens`函数来打印单词"pizza"被采样的次数
- 让我们从5.3.1节中定义的代码开始

- 当温度为0或0.1时，它被采样0次，当温度缩放到5时，它被采样32次。估计概率为32/1000 * 100% = 3.2%

- 实际概率为4.3%，包含在重新缩放的softmax概率张量中（`scaled_probas[2][6]`）

- 下面是使用第5章代码的独立示例：

In [2]:
import torch

vocab = { 
    "closer": 0,
    "every": 1, 
    "effort": 2, 
    "forward": 3,
    "inches": 4,
    "moves": 5, 
    "pizza": 6,
    "toward": 7,
    "you": 8,
} 
inverse_vocab = {v: k for k, v in vocab.items()}

next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

def print_sampled_tokens(probas):
    torch.manual_seed(123)
    sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]
    sampled_ids = torch.bincount(torch.tensor(sample))
    for i, freq in enumerate(sampled_ids):
        print(f"{freq} x {inverse_vocab[i]}")


def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)


temperatures = [1, 0.1, 5]  # 原始、更高和更低温度
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]

- 现在，我们可以遍历`scaled_probas`并在每种情况下打印采样频率：

In [3]:
for i, probas in enumerate(scaled_probas):
    print("\n\nTemperature:", temperatures[i])
    print_sampled_tokens(probas)



Temperature: 1
73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward


Temperature: 0.1
0 x closer
0 x every
0 x effort
985 x forward
0 x inches
0 x moves
0 x pizza
15 x toward


Temperature: 5
165 x closer
75 x every
42 x effort
239 x forward
71 x inches
46 x moves
32 x pizza
227 x toward
103 x you


- 注意，当单词"pizza"被采样时，采样提供了实际概率的近似值
- 例如，如果它被采样32/1000次，估计概率为3.2%
- 要获得实际概率，我们可以通过访问`scaled_probas`中的相应条目来直接检查概率

- 由于"pizza"是词汇表中的第7个条目，对于温度为5的情况，我们按如下方式获得它：

In [4]:
temp5_idx = 2
pizza_idx = 6

scaled_probas[temp5_idx][pizza_idx]

tensor(0.0430)

如果温度设置为5，单词"pizza"被采样的概率为4.3%

# 练习5.2：不同的温度和top-k设置

- 温度和top-k设置都必须根据个别LLM进行调整（一种试错过程，直到它生成理想的输出）
- 不过，理想的结果也是特定于应用的
  - 较低的top-k和温度会导致较少的随机结果，这在创建教育内容、技术写作或问答、数据分析、代码生成等时是理想的
  - 较高的top-k和温度会导致更多样化和随机的输出，这对于头脑风暴任务、创意写作等更为理想

# 练习5.3：解码函数中的确定性行为

有多种方法可以强制`generate`函数的确定性行为：

1. 设置`temperature=0.0`；
2. 设置`top_k=1`。

下面是使用第5章代码的独立示例：

In [4]:
import tiktoken
import torch
from previous_chapters import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,  # 词汇表大小
    "context_length": 256,       # 缩短的上下文长度（原始：1024）
    "emb_dim": 768,       # 嵌入维度
    "n_heads": 12,        # 注意力头数
    "n_layers": 12,       # 层数
    "drop_rate": 0.1,     # Dropout率
    "qkv_bias": False     # 查询-键-值偏置
}


torch.manual_seed(123)

tokenizer = tiktoken.get_encoding("gpt2")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth", weights_only=True))
model.eval();

In [5]:
from gpt_generate import generate, text_to_token_ids, token_ids_to_text
from previous_chapters import generate_text_simple

In [6]:
# 使用torch.argmax的确定性函数

start_context = "Every effort moves you"

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


In [8]:
# 确定性行为：无top_k，无温度缩放

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,
    temperature=0.0
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


- 注意，重新执行前面的代码单元格将产生完全相同的生成文本：

In [9]:
# 确定性行为：无top_k，无温度缩放

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,
    temperature=0.0
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


# 练习5.4：继续预训练

- 如果我们仍在第5章中首次训练模型的Python会话中，要继续预训练一个epoch，我们只需加载在主章节中保存的模型和优化器，然后再次调用`train_model_simple`函数

- 在这个新的代码环境中使其可重现需要更多步骤
- 首先，我们加载分词器、模型和优化器：

In [10]:
import tiktoken
import torch
from previous_chapters import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,   # 词汇表大小
    "context_length": 256, # 缩短的上下文长度（原始：1024）
    "emb_dim": 768,        # 嵌入维度
    "n_heads": 12,         # 注意力头数
    "n_layers": 12,        # 层数
    "drop_rate": 0.1,      # Dropout率
    "qkv_bias": False      # 查询-键-值偏置
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = tiktoken.get_encoding("gpt2")

checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();

- 接下来，我们初始化数据加载器：

In [11]:
import os
import urllib.request
from previous_chapters import create_dataloader_v1


file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()


# 训练/验证比例
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]


torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

- 最后，我们使用`train_model_simple`函数来训练模型：

In [12]:
from gpt_train import train_model_simple

num_epochs = 1
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)

Ep 1 (Step 000000): Train loss 0.271, Val loss 6.545
Ep 1 (Step 000005): Train loss 0.244, Val loss 6.614
Every effort moves you?"  "Yes--quite insensible to the irony. She wanted him vindicated--and by me!"  He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I


# 练习5.5：预训练模型的训练和验证集损失

- 我们可以使用以下代码来计算GPT模型的训练和验证集损失：

```python
train_loss = calc_loss_loader(train_loader, gpt, device)
val_loss = calc_loss_loader(val_loader, gpt, device)
```

- 124M参数模型的结果损失如下：

```
Training loss: 3.754748503367106
Validation loss: 3.559617757797241
```

- 主要观察是训练和验证集性能在同一水平
- 这可能有多种解释：

1. 当OpenAI训练GPT-2时，《判决》不是预训练数据集的一部分。因此，模型没有明确过拟合训练集，在《判决》的训练和验证集部分上表现相似。（验证集损失略低于训练集损失，这在深度学习中是不寻常的。然而，这可能是由于随机噪声，因为数据集相对较小。在实践中，如果没有过拟合，训练和验证集性能预期大致相同）。

2. 《判决》是GPT-2训练数据集的一部分。在这种情况下，我们无法判断模型是否过拟合训练数据，因为验证集也会被用于训练。要评估过拟合程度，我们需要一个在OpenAI完成GPT-2训练后生成的新数据集，以确保它不可能是预训练的一部分。

下面的代码是这个新notebook的可重现独立示例。

In [13]:
import tiktoken
import torch
from previous_chapters import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,   # 词汇表大小
    "context_length": 256, # 缩短的上下文长度（原始：1024）
    "emb_dim": 768,        # 嵌入维度
    "n_heads": 12,         # 注意力头数
    "n_layers": 12,        # 层数
    "drop_rate": 0.1,      # Dropout率
    "qkv_bias": False      # 查询-键-值偏置
}


torch.manual_seed(123)

tokenizer = tiktoken.get_encoding("gpt2")

# 练习5.6：尝试更大的模型

- 在主章节中，我们实验了最小的GPT-2模型，它只有124M参数
- 原因是为了保持资源需求尽可能低
- 但是，您可以通过最少的代码更改轻松实验更大的模型
- 例如，在第5章中加载1558M而不是124M模型，我们只需要更改2行代码

```python
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")
model_name = "gpt2-small (124M)"
```

- 更新后的代码变为


```python
settings, params = download_and_load_gpt2(model_size="1558M", models_dir="gpt2")
model_name = "gpt2-xl (1558M)"
```

In [20]:
import tiktoken
import torch
from previous_chapters import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,   # 词汇表大小
    "context_length": 256, # 缩短的上下文长度（原始：1024）
    "emb_dim": 768,        # 嵌入维度
    "n_heads": 12,         # 注意力头数
    "n_layers": 12,        # 层数
    "drop_rate": 0.1,      # Dropout率
    "qkv_bias": False      # 查询-键-值偏置
}


tokenizer = tiktoken.get_encoding("gpt2")

In [21]:
from gpt_download import download_and_load_gpt2
from gpt_generate import load_weights_into_gpt


model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

model_name = "gpt2-xl (1558M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval()

settings, params = download_and_load_gpt2(model_size="1558M", models_dir="gpt2")
load_weights_into_gpt(gpt, params)

File already exists and is up-to-date: gpt2/1558M/checkpoint
File already exists and is up-to-date: gpt2/1558M/encoder.json
File already exists and is up-to-date: gpt2/1558M/hparams.json
File already exists and is up-to-date: gpt2/1558M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/1558M/model.ckpt.index
File already exists and is up-to-date: gpt2/1558M/model.ckpt.meta
File already exists and is up-to-date: gpt2/1558M/vocab.bpe


In [22]:
from gpt_generate import generate, text_to_token_ids, token_ids_to_text

In [23]:
torch.manual_seed(123)

token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=NEW_CONFIG["context_length"],
    top_k=50,
    temperature=1.5
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you toward finding an ideal life. You don't have to accept your current one at once, because if you do you'll never
