<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# 将手零实现的GPT结构转换为Llama2
- 本notebook一步步将原始GPT架构转换为Llama2；注意，GPT和GPT2架构相同
- 为什么不是Llama1或Llama3？
  - Llama1架构和Llama2相似，除了Llama2有更大的上下文窗口；Llama1的权重访问不方便并且有很多限制，因此将目标设置为Llama2更合适
  - 关于Llama 3，会提供一个单独的笔记本，将Llama 2转换为Llama 3（只有几个小的额外更改）
- 本notebook中的解释被有意识地保持在最小限度，以避免不必要的臃肿，并专注于主要代码
- 欲了解更多信息，请参阅Llama 2论文：[Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/gpt2-to-llama2-llama3.webp?1">

- 本notebook中需要使用的库

In [1]:
from importlib.metadata import version

pkgs = [
    "huggingface_hub",  # 下载预训练权重
    "sentencepiece",    # 实现分词器
    "torch",            # 实现模型
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

huggingface_hub version: 0.27.1
sentencepiece version: 0.2.0
torch version: 2.5.0+cu121


&nbsp;
# 1. 逐步转换GPT模型实现

- 在本节中，将通过[第4章](../../ch04/01_main-chapter-code/ch04.ipynb)中的GPT模型代码逐步修改它以实现Llama2架构
- 之后，将加载Meta AI共享的原始Llama2权重


&nbsp;
## 1.1 用RMSNorm层替换LayerNorm层

- 首先，用均方根层归一化（RMSNorm）替换LayerNorm
- LayerNorm使用均值和方差归一化输入，而RMSNorm仅使用均方根，这提高了计算效率
- RMSNorm操作如下，其中$x$是输入，$\gamma$是可训练参数（向量），$\epsilon$是一个小常数，用于避免零除错误：

$$y_i = \frac{x_i}{\text{RMS}(x)} \gamma_i, \quad \text{其中} \quad \text{RMS}(x) = \sqrt{\epsilon + \frac{1}{n} \sum x_i^2}$$

- 更多详情，请参阅论文[Root Mean Square Layer Normalization (2019)](https://arxiv.org/abs/1910.07467)


In [2]:
import torch
import torch.nn as nn

In [3]:
#####################################
# Chapter 4
#####################################

# class LayerNorm(nn.Module):
#     def __init__(self, emb_dim: int) -> None:
#         super().__init__()
#         self.eps = 1e-5
#         self.scale = nn.Parameter(torch.ones(emb_dim))
#         self.shift = nn.Parameter(torch.zeros(emb_dim))

#     def forward(self, x: torch.Tensor) -> torch.Tensor:
#         mean = x.mean(dim=-1, keepdim=True)
#         var = x.var(dim=-1, keepdim=True, unbiased=False)
#         norm_x = (x - mean) / torch.sqrt(var + self.eps)
#         return self.scale * norm_x + self.shift


class RMSNorm(nn.Module):
    def __init__(self, emb_dim: int, eps: float = 1e-5) -> None:
        super().__init__()
        self.eps = eps
        self.emb_dim = emb_dim
        self.weight = nn.Parameter(torch.ones(emb_dim)).float()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        means = x.pow(2).mean(dim=-1, keepdim=True)
        x_normed = x * torch.rsqrt(means + self.eps)
        return (x_normed * self.weight).to(dtype=x.dtype)

- 以下代码cell校验上述实现和torch内置实现是否相同

In [4]:
torch.manual_seed(123)

example_batch = torch.randn(2, 3, 4)  # 随机生成一个2x3x4的tensor

rms_norm = RMSNorm(emb_dim=example_batch.shape[-1])  # 创建RMSNorm实例
rmsnorm_pytorch = torch.nn.RMSNorm(example_batch.shape[-1], eps=1e-5)  # 创建torch.nn.RMSNorm实例

assert torch.allclose(rms_norm(example_batch), rmsnorm_pytorch(example_batch))  # 校验两个实例的输出是否相同

&nbsp;
## 1.2 用SiLU激活函数替换GELU激活函数

- Llama使用SiLU激活函数（而不是GELU），它也被称为Swish函数：

$$
\text{silu}(x) = x \cdot \sigma(x), \quad \text{其中} \quad \sigma(x) \text{ 是logistic sigmoid函数。}
$$

- 有关更多信息，请参阅SiLU论文：[Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning (2017)](https://arxiv.org/abs/1702.03118)

In [5]:
#####################################
# Chapter 4
#####################################

# class GELU(nn.Module):
#     def __init__(self) -> None:
#         super().__init__()

#     def forward(self, x: torch.Tensor) -> torch.Tensor:
#         return 0.5 * x * (1 + torch.tanh(
#             torch.sqrt(torch.tensor(2.0 / torch.pi)) *
#             (x + 0.044715 * torch.pow(x, 3))
#         ))


class SiLU(nn.Module):
    def __init__(self) -> None:
        super(SiLU, self).__init__()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x * torch.sigmoid(x)

In [6]:
silu = SiLU()

assert torch.allclose(silu(example_batch), torch.nn.functional.silu(example_batch))

&nbsp;
## 1.3 更新FeedForward层

- 实际上，Llama使用了一种称为SwiGLU的SiLU的"门控线性单元"(GLU)变体，这基本上导致了一个结构略有不同的`FeedForward`模块
- SwiGLU在前馈层中使用门控机制，公式为：

$$\text{SwiGLU}(x) = \text{SiLU}(\text{Linear}_1(x)) * \text{Linear}_2(x)$$

- 这里，$\text{Linear}_1$和$\text{Linear}_2$是两个线性层，$*$表示元素级乘法
- 第三个线性层$\text{Linear}_3$在这个门控激活之后应用

- 有关更多信息，请参阅SwiGLU论文：[GLU Variants Improve Transformer (2020)](https://arxiv.org/abs/2002.05202)


In [None]:
#####################################
# Chapter 4
#####################################
# class FeedForward(nn.Module):
#     def __init__(self, cfg) -> None:
#         super().__init__()
#         self.layers = nn.Sequential(
#             nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
#             GELU(),
#             nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
#         )

#     def forward(self, x: torch.Tensor) -> torch.Tensor:
#         return self.layers(x)

In [7]:
class FeedForward(nn.Module):
    def __init__(self, cfg) -> None:
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)
        self.silu = SiLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x_fc1 = self.fc1(x)
        x_fc2 = self.fc2(x)
        x = self.silu(x_fc1) * x_fc2
        return self.fc3(x)

&nbsp;
## 1.4 实现RoPE

- 在GPT模型中，位置嵌入是如下实现的：

```python
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
```

- 与传统的绝对位置嵌入不同，Llama使用旋转位置嵌入(RoPE)，这使它能够同时捕获绝对和相对位置信息
- RoPE的参考论文是[RoFormer: Enhanced Transformer with Rotary Position Embedding (2021)](https://arxiv.org/abs/2104.09864)
