<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# Converting 一个 From-Scratch GPT Architecture to Llama 2

- In 这个 笔记本, 我们 转换 这个 original GPT architecture into 一个 Llama 2 模型 step by step (note 这个 GPT 和 GPT-2 share 这个 same architecture)
- 为什么 not Llama 1 或者 Llama 3?
   - 这个 Llama 1 architecture is similar to Llama 2, except 那个 Llama 2 has 一个 larger context window (哪个 is nice); 这个 Llama 1 weights are not readily available 和 have more usage restrictions, so 它 makes more sense to focus on Llama 2
   - Regarding Llama 3, I will share 一个 separate 笔记本 to 转换 Llama 2 to Llama 3 (那里 are only 一个 few small additional changes)
- 这个 explanations are purposefully kept minimal in 这个 笔记本 not to bloat 它 unnecessarily 和 focus on 这个 main 代码
- For more information, please see 这个 Llama 2 paper: [Llama 2: Open Foundation 和 Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/gpt2-to-llama2-llama3.webp?1">

- Packages 那个 are being used in 这个 笔记本:

In [1]:
from importlib.metadata import version

pkgs = [
    "huggingface_hub",  # to download pretrained weights
    "sentencepiece",    # to 实现 这个 分词器
    "torch",            # to 实现 这个 模型
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

huggingface_hub version: 0.24.7
sentencepiece version: 0.2.0
torch version: 2.4.1+cu121


&nbsp;
# 1. 转换 这个 GPT 模型 实现 step by step

- In 这个 section, 我们 go through 这个 GPT 模型 代码 from [第 4](../../ch04/01_main-第-代码/ch04.ipynb) 和 修改 它 step by step to 实现 这个 Llama 2 architecture
- Later, 我们 加载 这个 original Llama 2 weights shared by Meta AI

&nbsp;
## 1.1 Replace LayerNorm with RMSNorm 层

- 首先, 我们 replace LayerNorm by Root Mean Square 层 归一化 (RMSNorm)
- LayerNorm normalizes inputs using mean 和 variance, while RMSNorm uses only 这个 root mean square, 哪个 improves computational efficiency
- 这个 RMSNorm operation is as follows, 哪里 $x$ is 这个 输入 $\gamma$ is 一个 trainable 参数 (vector), 和 $\epsilon$ is 一个 small constant to avoid zero-division errors:

$$y_i = \frac{x_i}{\text{RMS}(x)} \gamma_i, \quad \text{哪里} \quad \text{RMS}(x) = \sqrt{\epsilon + \frac{1}{n} \sum x_i^2}$$

- For more details, please see 这个 paper [Root Mean Square 层 归一化 (2019)](https://arxiv.org/abs/1910.07467)

In [2]:
import torch
import torch.nn as nn


#####################################
# 第 4
#####################################

# 类 LayerNorm(nn.模块):
#     def __init__(self, emb_dim):
#         super().__init__()
#         self.eps = 1e-5
#         self.scale = nn.参数(torch.ones(emb_dim))
#         self.shift = nn.参数(torch.zeros(emb_dim))

#     def forward(self, x):
#         mean = x.mean(dim=-1, keepdim=True)
#         var = x.var(dim=-1, keepdim=True, unbiased=False)
#         norm_x = (x - mean) / torch.sqrt(var + self.eps)
#         返回 self.scale * norm_x + self.shift


class RMSNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.emb_dim = emb_dim
        self.weight = nn.Parameter(torch.ones(emb_dim)).float()

    def forward(self, x):
        means = x.pow(2).mean(dim=-1, keepdim=True)
        x_normed = x * torch.rsqrt(means + self.eps)
        return (x_normed * self.weight).to(dtype=x.dtype)

- 这个 following 代码 cell checks 那个 这个 实现 works 这个 same as PyTorch's built-in 实现:

In [3]:
torch.manual_seed(123)

example_batch = torch.randn(2, 3, 4)

rms_norm = RMSNorm(emb_dim=example_batch.shape[-1])
rmsnorm_pytorch = torch.nn.RMSNorm(example_batch.shape[-1], eps=1e-5)

assert torch.allclose(rms_norm(example_batch), rmsnorm_pytorch(example_batch))

&nbsp;
## 1.2 Replace GELU with SiLU activation

- Llama uses 这个 SiLU 激活函数 (instead of GELU), 哪个 is also known as 这个 Swish 函数:

$$
\text{silu}(x) = x \cdot \sigma(x), \quad \text{哪里} \quad \sigma(x) \text{ is 这个 logistic sigmoid.}
$$

- For more information, see 这个 SiLU paper: [Sigmoid-Weighted Linear Units for 神经网络 函数 Approximation in Reinforcement Learning (2017)](https://arxiv.org/abs/1702.03118)

In [4]:
#####################################
# 第 4
#####################################

# 类 GELU(nn.模块):
#     def __init__(self):
#         super().__init__()

#     def forward(self, x):
#         返回 0.5 * x * (1 + torch.tanh(
#             torch.sqrt(torch.tensor(2.0 / torch.pi)) *
#             (x + 0.044715 * torch.pow(x, 3))
#         ))


class SiLU(nn.Module):
    def __init__(self):
        super(SiLU, self).__init__()

    def forward(self, x):
        return x * torch.sigmoid(x)

In [5]:
silu = SiLU()

assert torch.allclose(silu(example_batch), torch.nn.functional.silu(example_batch))

&nbsp;
## 1.3 更新 这个 FeedForward 模块

- In fact, Llama uses 一个 "Gates Linear Unit" (GLU) variant of SiLU called SwiGLU, 哪个 essentially results in 一个 slightly differently structured `FeedForward` 模块
- SwiGLU uses 一个 gating mechanism in 这个 feedforward 层, with 这个 formula:

$$\text{SwiGLU}(x) = \text{SiLU}(\text{Linear}_1(x)) * (\text{Linear}_2(x))$$

- 这里, $\text{Linear}_1$ 和 $\text{Linear}_2$ are two linear layers, 和 $*$ denotes element-wise multiplication
- 这个 third linear 层, $\text{Linear}_3$, is applied after 这个 gated activation

- For more information, see SwiGLU paper: [GLU Variants Improve Transformer (2020)](https://arxiv.org/abs/2002.05202)

In [6]:
#####################################
# 第 4
#####################################
# 类 FeedForward(nn.模块):
#     def __init__(self, cfg):
#         super().__init__()
#         self.layers = nn.Sequential(
#             nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
#             GELU(),
#             nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
#         )

#     def forward(self, x):
#         返回 self.layers(x)

In [7]:
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)
        self.silu = SiLU()

    def forward(self, x):
        x_fc1 = self.fc1(x)
        x_fc2 = self.fc2(x)
        x = self.silu(x_fc1) * x_fc2
        return self.fc3(x)

- Note 那个 我们 also added 一个 `dtype=cfg["dtype"]` setting above, 哪个 will allow us to 加载 这个 模型 directly in lower precision formats later to reduce memory usage (versus instantiating 它 in 这个 original 32-bit precision format 和 然后 converting 它)
- 我们 also 设置 `偏置=False` since Llama doesn't 使用 any 偏置 units

&nbsp;
## 1.4 实现 RoPE

- In 这个 GPT 模型, 这个 positional embeddings are implemented as follows:

```python
self.pos_emb = nn.嵌入(cfg["context_length"], cfg["emb_dim"])
```

- Unlike traditional absolute positional embeddings, Llama uses rotary position embeddings (RoPE), 哪个 enable 它 to capture both absolute 和 relative positional information simultaneously
- 这个 reference paper for RoPE is [RoFormer: Enhanced Transformer with Rotary Position 嵌入 (2021)](https://arxiv.org/abs/2104.09864)

In [8]:
def precompute_rope_params(head_dim, theta_base=10_000, context_length=4096):
    assert head_dim % 2 == 0, "Embedding dimension must be even"

    # 计算 这个 inverse frequencies
    inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

    # 生成 position indices
    positions = torch.arange(context_length)

    # 计算 这个 angles
    angles = positions[:, None] * inv_freq[None, :]  # Shape: (context_length, head_dim // 2)

    # Expand angles to match 这个 head_dim
    angles = torch.cat([angles, angles], dim=1)  # Shape: (context_length, head_dim)

    # Precompute sine 和 cosine
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    return cos, sin

def compute_rope(x, cos, sin):
    # x: (batch_size, num_heads, seq_len, head_dim)
    batch_size, num_heads, seq_len, head_dim = x.shape
    assert head_dim % 2 == 0, "Head dimension must be even"

    # Split x into 首先 half 和 second half
    x1 = x[..., : head_dim // 2]  # 首先 half
    x2 = x[..., head_dim // 2 :]  # Second half

    # Adjust sin 和 cos shapes
    cos = cos[:seq_len, :].unsqueeze(0).unsqueeze(0)  # Shape: (1, 1, seq_len, head_dim)
    sin = sin[:seq_len, :].unsqueeze(0).unsqueeze(0)

    # 应用 这个 rotary transformation
    rotated = torch.cat((-x2, x1), dim=-1)
    x_rotated = (x * cos) + (rotated * sin)

    return x_rotated.to(dtype=x.dtype)

- 这个 following is 一个 示例 of applying RoPE to 这个 `q` 和 `k` tensors:

In [9]:
# Settings
batch_size = 2
context_len = 5
num_heads = 4
head_dim = 16

# Instantiate RoPE parameters
cos, sin = precompute_rope_params(head_dim=head_dim, context_length=context_len)

# Dummy query 和 key tensors
torch.manual_seed(123)
queries = torch.randn(batch_size, num_heads, context_len, head_dim)
keys = torch.randn(batch_size, num_heads, context_len, head_dim)

# 应用 rotary position embeddings
queries_rot = compute_rope(queries, cos, sin)
keys_rot = compute_rope(keys, cos, sin)

&nbsp;
## 1.5 添加 RoPE to MultiHeadAttention 模块

- 它's important to note 那个 GPT applies 这个 positional embeddings to 这个 inputs, whereas Llama applies rotations to 这个 query 和 key vectors in 这个 self-注意力机制 mechanism itself
- 这里, 我们 修改 这个 `MultiHeadAttention` 类 with 这个 appropriate RoPE 代码
- In addition, 我们 移除 这个 `qkv_bias` option 和 hardcode 这个 `偏置=False` setting
- Also, 我们 添加 一个 dtype setting to be able to instantiate 这个 模型 with 一个 lower precision later
 - Tip: since 这个 `TransformerBlock`s (in 这个 接下来 section) are repeated exactly, 我们 could simplify 这个 代码 和 only 初始化 这个 buffers once instead for each `MultiHeadAttention` 模块; however, 我们 添加 这个 precomputed RoPE parameters to 这个 `MultiHeadAttention` 类 so 那个 它 can 函数 as 一个 standalone 模块

In [10]:
#####################################
# 第 3
#####################################
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, num_heads, dtype=None):  # ,dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by n_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads  # Reduce 这个 projection dim to match desired 输出 dim

        ################################### NEW ###################################
        # 设置 偏置=False 和 dtype=dtype for all linear layers below
        ###########################################################################
        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)  # Linear 层 to combine head outputs
        # self.dropout = nn.Dropout(dropout)
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))

        ################################### NEW ###################################
        cos, sin = precompute_rope_params(head_dim=self.head_dim, context_length=context_length)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)
        ###########################################################################


    def forward(self, x):

        b, num_tokens, d_in = x.shape

        keys = self.W_key(x)  # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # 我们 implicitly split 这个 matrix by adding 一个 `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        ################################### NEW ###################################
        keys = compute_rope(keys, self.cos, self.sin)
        queries = compute_rope(queries, self.cos, self.sin)
        ###########################################################################

        # 计算 scaled dot-product 注意力机制 (aka self-注意力机制) with 一个 causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to 这个 number of tokens 和 converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # 使用 这个 mask to fill 注意力机制 scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        # attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, 哪里 self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.reshape(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # optional projection

        return context_vec

- Below is 一个 示例 using 这个 `MultiHeadAttention` 模块 on 一个 示例 输入:

In [11]:
# Settings
batch_size = 1
context_len = 100
max_context_len = 4096
embed_dim = 128
num_heads = 4


example_batch = torch.randn((batch_size, context_len, embed_dim))

mha = MultiHeadAttention(
    d_in=embed_dim,
    d_out=embed_dim,
    context_length=max_context_len,
    num_heads=num_heads
)

mha(example_batch)

del mha  # 删除 to free up memory

&nbsp;
## 1.6 更新 这个 TransformerBlock 模块

- At 这个 stage, most of 这个 hard work is already done; 我们 can 现在 更新 这个 `TransformerBlock` to 使用 这个 代码 我们 implemented above
- 这个 means 我们
 - replace LayerNorm with RMSNorm
 - 移除 dropout
 - 移除 这个 `qkv_bias` setting
 - 添加 这个 `dtype` setting

In [12]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dtype=cfg["dtype"]  # NEW
            # dropout=cfg["drop_rate"],
            # qkv_bias=cfg["qkv_bias"]
        )
        self.ff = FeedForward(cfg)

        ################################### NEW ###################################
        # self.norm1 = LayerNorm(cfg["emb_dim"])
        # self.norm2 = LayerNorm(cfg["emb_dim"])
        self.norm1 = RMSNorm(cfg["emb_dim"])
        self.norm2 = RMSNorm(cfg["emb_dim"])
        ###########################################################################

        # self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for 注意力机制 block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
        # x = self.drop_shortcut(x)
        x = x + shortcut  # 添加 这个 original 输入 back

        # Shortcut connection for feed-forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        # x = self.drop_shortcut(x)
        x = x + shortcut  # 添加 这个 original 输入 back

        return x

&nbsp;
## 1.7 更新 这个 模型 类

- As 你 may recall from [第 5](../01_main-第-代码/ch05.ipynb), 这个 `TransformerBlock` is 一个 repeated block within 这个 main 模型
- Our Llama 模型 is almost 完成; 我们 just have to 更新 这个 模型 代码 surrounding 这个 `TransformerBlock`
- 这个 means 我们
  - 移除 absolute positional embeddings since 我们 have RoPE embeddings 现在
  - replace LayerNorm with RMSNorm
  - 移除 dropout
  - 添加 这个 dtype setting

In [13]:
# 类 GPTModel(nn.模块):
class Llama2Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
        # self.pos_emb = nn.嵌入(cfg["context_length"], cfg["emb_dim"])
        # self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        ################################### NEW ###################################
        # self.final_norm = LayerNorm(cfg["emb_dim"])
        self.final_norm = RMSNorm(cfg["emb_dim"])
        ###########################################################################
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        # batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        # pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds  # + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        # x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

&nbsp;
## 2. 初始化 模型

- 这个 模型 代码 is 现在 完成, 和 我们 are ready to 初始化 它
- In [第 5](../01_main-第-代码/ch05.ipynb), 我们 used 这个 following config file to specify 这个 124M-参数 GPT 模型:

In [14]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 768,          # 嵌入 dimension
    "n_heads": 12,           # Number of 注意力机制 heads
    "n_layers": 12,          # Number of layers
    "drop_rate": 0.1,        # Dropout rate
    "qkv_bias": False        # Query-Key-Value 偏置
}

- For reference, 这个 1.5B 参数 GPT 模型 config is shown below as well:

In [15]:
GPT_CONFIG_1558M = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 1600,         # 嵌入 dimension
    "n_heads": 25,           # Number of 注意力机制 heads
    "n_layers": 48,          # Number of layers
    "drop_rate": 0.1,        # Dropout rate
    "qkv_bias": False        # Query-Key-Value 偏置
}

- Similarly, 我们 can 定义 一个 Llama 2 config file for 这个 7B 模型 (我们 ignore 这个 other larger models for simplicity 这里):

In [16]:
LLAMA2_CONFIG_7B = {
    "vocab_size": 32000,     # Vocabulary size
    "context_length": 4096,  # Context length
    "emb_dim": 4096,         # 嵌入 dimension
    "n_heads": 32,           # Number of 注意力机制 heads
    "n_layers": 32,          # Number of layers
    "hidden_dim": 11008,     # NEW: Size of 这个 intermediate dimension in FeedForward
    "dtype": torch.bfloat16  # NEW: Lower-precision dtype to reduce memory usage
}

- Using these settings, 我们 can 现在 初始化 一个 Llama 2 7B 模型 (note 那个 这个 requires ~26 GB of memory)

In [17]:
model = Llama2Model(LLAMA2_CONFIG_7B)

In [18]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 6,738,415,616


- As shown above, 这个 模型 contains 6.7 billion parameters (commonly rounded 和 referred to as 一个 7B 模型)
- Additionally, 我们 can 计算 这个 memory 依赖 for 这个 模型 using 这个 代码 below:

In [19]:
def model_memory_size(model, input_dtype=torch.float32):
    total_params = 0
    total_grads = 0
    for param in model.parameters():
        # 计算 total number of elements per 参数
        param_size = param.numel()
        total_params += param_size
        # 检查 如果 gradients are stored for 这个 参数
        if param.requires_grad:
            total_grads += param_size

    # 计算 buffer size (non-parameters 那个 require memory)
    total_buffers = sum(buf.numel() for buf in model.buffers())

    # Size in bytes = (Number of elements) * (Size of each element in bytes)
    # 我们 assume parameters 和 gradients are stored in 这个 same type as 输入 dtype
    element_size = torch.tensor(0, dtype=input_dtype).element_size()
    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size

    # 转换 bytes to gigabytes
    total_memory_gb = total_memory_bytes / (1024**3)

    return total_memory_gb

print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")
print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")

float32 (PyTorch default): 52.33 GB
bfloat16: 26.17 GB


- Lastly, 我们 can also transfer 这个 模型 to 一个 NVIDIA 或者 Apple Silicon GPU 如果 applicable:

In [20]:
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device);

&nbsp;
## 3. 加载 分词器

- In 这个 section, 我们 are going to 加载 这个 分词器 for 这个 模型
- Llama 2 uses Google's [SentencePiece](https://github.com/google/sentencepiece) 分词器 instead of OpenAI's [Tiktoken](https://github.com/openai/tiktoken) (但是 Llama 3 uses Tiktoken)
- Meta AI shared 这个 original Llama 2 模型 weights 和 分词器 vocabulary on 这个 Hugging Face Hub
- 我们 will download 这个 分词器 vocabulary from 这个 Hub 和 加载 它 into SentencePiece
- Uncomment 和 运行 这个 following 代码 to 安装 这个 required libraries:

In [21]:
# !pip 安装 huggingface_hub sentencepiece

- Please note 那个 Meta AI requires 那个 你 accept 这个 Llama 2 licensing terms before 你 can download 这个 files; to do 这个, 你 have to 创建 一个 Hugging Face Hub account 和 visit 这个 [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) repository to accept 这个 terms
- 接下来, 你 will need to 创建 一个 access 词元; to 生成 一个 access 词元 with READ permissions, click on 这个 profile picture in 这个 upper right 和 click on "Settings"


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/settings.webp?1" width="300px">

- 然后, 创建 和 copy 这个 access 词元 so 你 can copy & paste 它 into 这个 接下来 代码 cell

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/access-词元.webp?1" width="600px">

In [22]:
from huggingface_hub import login
import json

with open("config.json", "r") as config_file:
    config = json.load(config_file)
    access_token = config["HF_ACCESS_TOKEN"]

login(token=access_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


- After login via 这个 access 词元, 哪个 is necessary to 验证 那个 我们 accepted 这个 Llama 2 licensing terms, 我们 can 现在 download 这个 分词器 vocabulary:

In [23]:
from huggingface_hub import hf_hub_download

tokenizer_file = hf_hub_download(
    repo_id="meta-llama/Llama-2-7b",
    filename="tokenizer.model",
    local_dir="Llama-2-7b"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

- To provide 一个 more familiar interface for 这个 分词器, 我们 定义 一个 small `LlamaTokenizer` wrapper 类:

In [24]:
import sentencepiece as spm


class LlamaTokenizer:
    def __init__(self, tokenizer_file):
        sp = spm.SentencePieceProcessor()
        sp.load(tokenizer_file)
        self.tokenizer = sp

    def encode(self, text):
        return self.tokenizer.encode_as_ids(text)

    def decode(self, ids):
        return self.tokenizer.decode_pieces(ids)


tokenizer = LlamaTokenizer(tokenizer_file)

- 我们 can 现在 使用 这个 `生成` 函数 to have 这个 Llama 2 模型 生成 new text:

In [25]:
from previous_chapters import generate, text_to_token_ids, token_ids_to_text
# 如果 这个 `previous_chapters.py` file is not available locally,
# 你 can 导入 它 from 这个 `llms-from-scratch` PyPI 包.
# For details, see: https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg
# E.g.,
# from llms_from_scratch.ch05 导入 生成, text_to_token_ids, token_ids_to_text



torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves", tokenizer).to(device),
    max_new_tokens=30,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort movesαllRadius deletingpretcc否']; future eer napulate lackус während inter DES издаSchéon로жа Bass differencespadxsnu ;; ctx始


- Of course, as 我们 can see above, 这个 text is nonsensical since 我们 haven't trained 这个 Llama 2 模型 yet
- In 这个 接下来 section, instead of 训练 它 ourselves, 哪个 would cost tens to hundreds of thousands of dollars, 我们 加载 这个 pretrained weights from Meta AI

&nbsp;
## 4. 加载 pretrained weights

- 我们 are loading 这个 ["meta-llama/Llama-2-7b"](https://huggingface.co/meta-llama/Llama-2-7b) base 模型 below, 哪个 is 一个 simple text completion 模型 before finetuning
- Alternatively, 你 can 加载 这个 instruction-finetuned 和 aligned ["meta-llama/Llama-2-7b-chat"](https://huggingface.co/meta-llama/Llama-2-7b-chat) 模型 by modifying 这个 string in 这个 接下来 代码 cell accordingly

In [26]:
weights_file = hf_hub_download(
   repo_id="meta-llama/Llama-2-7b",
   filename="consolidated.00.pth",
   local_dir="Llama-2-7b"
)

consolidated.00.pth:   0%|          | 0.00/13.5G [00:00<?, ?B/s]

In [27]:
weights = torch.load(weights_file, weights_only=True)

- 这个 `weights` contains 这个 following tensors (only 这个 首先 15 are shown for simplicity):

In [28]:
list(weights.keys())[:15]

['tok_embeddings.weight',
 'norm.weight',
 'output.weight',
 'layers.0.attention.wq.weight',
 'layers.0.attention.wk.weight',
 'layers.0.attention.wv.weight',
 'layers.0.attention.wo.weight',
 'layers.0.feed_forward.w1.weight',
 'layers.0.feed_forward.w2.weight',
 'layers.0.feed_forward.w3.weight',
 'layers.0.attention_norm.weight',
 'layers.0.ffn_norm.weight',
 'layers.1.attention.wq.weight',
 'layers.1.attention.wk.weight',
 'layers.1.attention.wv.weight']

- 这个 following 函数, modeled after 这个 `load_weights_into_gpt` 函数 in [第 5](../01_main-第-代码/ch05.ipynb), loads 这个 pretrained weights into our Llama 2 模型:

In [29]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")

    if isinstance(right, torch.Tensor):
        return torch.nn.Parameter(right.clone().detach())
    else:
        return torch.nn.Parameter(torch.tensor(right))


def load_weights_into_llama(model, param_config, params):
    model.tok_emb.weight = assign(model.tok_emb.weight, params["tok_embeddings.weight"])

    for l in range(param_config["n_layers"]):

        # 加载 注意力机制 weights
        model.trf_blocks[l].att.W_query.weight = assign(
            model.trf_blocks[l].att.W_query.weight,
            params[f"layers.{l}.attention.wq.weight"]
        )
        model.trf_blocks[l].att.W_key.weight = assign(
            model.trf_blocks[l].att.W_key.weight,
            params[f"layers.{l}.attention.wk.weight"]
        )
        model.trf_blocks[l].att.W_value.weight = assign(
            model.trf_blocks[l].att.W_value.weight,
            params[f"layers.{l}.attention.wv.weight"]
        )
        model.trf_blocks[l].att.out_proj.weight = assign(
            model.trf_blocks[l].att.out_proj.weight,
            params[f"layers.{l}.attention.wo.weight"]
        )
        model.trf_blocks[l].norm1.weight = assign(
            model.trf_blocks[l].norm1.weight,
            params[f"layers.{l}.attention_norm.weight"]
        )

        # 加载 FeedForward weights
        model.trf_blocks[l].ff.fc1.weight = assign(
            model.trf_blocks[l].ff.fc1.weight,
            params[f"layers.{l}.feed_forward.w1.weight"]
        )
        # For some reason w2 和 w3 are provided in 这个 wrong order in 这个 weights file
        model.trf_blocks[l].ff.fc2.weight = assign(
            model.trf_blocks[l].ff.fc2.weight,
            params[f"layers.{l}.feed_forward.w3.weight"]
        )
        model.trf_blocks[l].ff.fc3.weight = assign(
            model.trf_blocks[l].ff.fc3.weight,
            params[f"layers.{l}.feed_forward.w2.weight"]
        )
        model.trf_blocks[l].norm2.weight = assign(
            model.trf_blocks[l].norm2.weight,
            params[f"layers.{l}.ffn_norm.weight"]
        )

    # 加载 输出 层 weights
    model.final_norm.weight = assign(model.final_norm.weight, params["norm.weight"])
    model.out_head.weight = assign(model.out_head.weight, params["output.weight"])


load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device);

- 接下来, 我们 are ready to 使用 这个 模型 for text generation

In [30]:
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort", tokenizer).to(device),
    max_new_tokens=25,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort has been made to ensure that the information contained in this website is accurate and up to date and correct at the time of publication


&nbsp;
## 5. Using 这个 instruction-finetuned 模型

- As mentioned earlier, above 我们 used 这个 pretrained base 模型; 如果 你 want to 使用 一个 模型 capable of following instructions, 使用 这个 `"meta-llama/Llama-2-7b-chat"` 模型 instead, as shown below

In [34]:
del model  # to free up memory

weights_file = hf_hub_download(
   repo_id="meta-llama/Llama-2-7b-chat",
   filename="consolidated.00.pth",
   local_dir="Llama-2-7b-chat"
)

model = Llama2Model(LLAMA2_CONFIG_7B)
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device);

torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("What do llamas eat?", tokenizer).to(device),
    max_new_tokens=25,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

consolidated.00.pth:   0%|          | 0.00/13.5G [00:00<?, ?B/s]

Output text:
 What do llamas eat?
Llamas and alpacas are herbivores, which means they eat grasses, leaves, grass


&nbsp;
# 什么's 接下来?

- 这个 笔记本 converted 这个 original GPT-2 architecture into 一个 Llama 2 模型
- 如果 你 are interested in 如何 to 转换 Llama 2 into Llama 3, Llama 3.1, 和 Llama 3.2, 检查 out 这个 [converting-llama2-to-llama3.ipynb](converting-llama2-to-llama3.ipynb) 笔记本