<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# Converting Llama 2 to Llama 3.2 From Scratch

- 这个 is 一个 follow-up 笔记本 to [Converting 一个 From-Scratch GPT Architecture to Llama 2](./converting-GPT-to-llama2.ipynb), converting Meta AI's Llama 2 architecture 模型 step by step to Llama 3, Llama 3.1, 和 Llama 3.2
- 这个 explanations are purposefully kept minimal in 这个 笔记本 so as not to bloat 它 unnecessarily 和 focus on 这个 main 代码
- For more information about 这个 architectures, please see 这个 Llama 2 和 Llama 3 papers
 - [Llama 2: Open Foundation 和 Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)
 - [这个 Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/gpt2-to-llama2-llama3.webp?1">

In [1]:
# pip 安装 -r 依赖-extra.txt

- Packages 那个 are being used in 这个 笔记本:

In [2]:
from importlib.metadata import version

pkgs = [
    "blobfile",         # to download pretrained weights
    "huggingface_hub",  # to download pretrained weights
    "tiktoken",         # to 实现 这个 分词器
    "torch",            # to 实现 这个 模型
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

blobfile version: 3.0.0
huggingface_hub version: 0.24.7
tiktoken version: 0.8.0
torch version: 2.4.1+cu121


&nbsp;
# 1. 转换 这个 Llama 模型 实现 step by step

- 如果 你 are new to implementing 大语言模型 architectures, I recommend starting with [第 4](../../ch04/01_main-第-代码/ch04.ipynb), 哪个 walks 你 through 这个 实现 of 这个 original GPT architecture step by step
- 这个 [Converting 一个 From-Scratch GPT Architecture to Llama 2](./converting-GPT-to-llama2.ipynb) 然后 implements 这个 Llama-specific components, such as RMSNorm layers, SiLU 和 SwiGLU activations, RoPE (rotary position embeddings), 和 这个 SentencePiece 分词器
- 这个 笔记本 takes 这个 Llama 2 architecture 和 transforms 它 into Llama 3 architecture by
    1. modifying 这个 rotary embeddings
    2. implementing grouped-query 注意力机制
    3. 和 using 一个 customized version of 这个 GPT-4 分词器
- Later, 我们 然后 加载 这个 original Llama 3 weights shared by Meta AI into 这个 architecture

&nbsp;
## 1.1 Reusing Llama 2 components

- Llama 2 is actually quite similar to Llama 3, as mentioned above 和 illustrated in 这个 figure at 这个 top of 这个 笔记本
- 这个 means 那个 我们 can 导入 several building blocks from 这个 [Llama 2 笔记本](./converting-GPT-to-llama2.ipynb) using 这个 following 代码

In [3]:
import os
import sys
import io
import nbformat
import types

def import_from_notebook():
    def import_definitions_from_notebook(fullname, names):
        current_dir = os.getcwd()
        path = os.path.join(current_dir, fullname + ".ipynb")
        path = os.path.normpath(path)

        # 加载 这个 笔记本
        if not os.path.exists(path):
            raise FileNotFoundError(f"Notebook file not found at: {path}")

        with io.open(path, "r", encoding="utf-8") as f:
            nb = nbformat.read(f, as_version=4)

        # 创建 一个 模块 to store 这个 imported functions 和 classes
        mod = types.ModuleType(fullname)
        sys.modules[fullname] = mod

        # Go through 这个 笔记本 cells 和 only 执行 函数 或者 类 definitions
        for cell in nb.cells:
            if cell.cell_type == "code":
                cell_code = cell.source
                for name in names:
                    # 检查 for 函数 或者 类 definitions
                    if f"def {name}" in cell_code or f"class {name}" in cell_code:
                        exec(cell_code, mod.__dict__)
        return mod

    fullname = "converting-gpt-to-llama2"
    names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]

    return import_definitions_from_notebook(fullname, names)

In [4]:
imported_module = import_from_notebook()

# 我们 need to redefine precompute_rope_params
# precompute_rope_params = getattr(imported_module, "precompute_rope_params", None)
compute_rope = getattr(imported_module, "compute_rope", None)
SiLU = getattr(imported_module, "SiLU", None)
FeedForward = getattr(imported_module, "FeedForward", None)
RMSNorm = getattr(imported_module, "RMSNorm", None)

# MultiHeadAttention only for comparison purposes
MultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)

&nbsp;
## 1.2 Modified RoPE

- Llama 3 uses rotary position embeddings (RoPE) similar to Llama 2 (for 一个 detailed explanation, please see 这个 [RoPE paper](https://arxiv.org/abs/2104.09864))
- 那里 are some subtle differences in 这个 RoPE settings, though
 - Llama 3 现在 supports up to 8,192 tokens, twice as many as Llama 2 (4,096)
 - 这个 base value for 这个 so-called RoPE $\theta$ (see equation below) was increased from 10,000 (Llama 2) to 500,000 (Llama 3) in 这个 following equation (adapted from 这个 [RoPE paper](https://arxiv.org/abs/2104.09864))

$$\Theta = \left\{\theta_i = \text{base}^{\frac{-2(i-1)}{d}}, i \in \left[1, 2, ..., d/2\right]\right\}$$

- These $\theta$ values are 一个 设置 of predefined parameters 那个 are used to determine 这个 rotational angles in 这个 rotary matrix, 哪里 $d$ is 这个 dimensionality of 这个 嵌入 space
- Increasing 这个 base from 10,000 to 500,000 makes 这个 frequencies (或者 rotation angles) decay more slowly across 这个 dimensions, 哪个 means 那个 higher dimensions will be associated with larger angles than before (essentially, 它's 一个 decompression of 这个 frequencies)
- In addition, 我们 introduce 一个 `freq_config` section in 这个 代码 below 那个 adjusts 这个 frequency; however, 我们 won't be needing 它 in Llama 3 (only Llama 3.1 和 Llama 3.2), so 我们 will revisit 这个 `freq_config` later (它's 设置 to `None` 和 ignored by default)

In [5]:
import torch

def precompute_rope_params(head_dim, theta_base=10_000, context_length=4096, freq_config=None):
    assert head_dim % 2 == 0, "Embedding dimension must be even"

    # 计算 这个 inverse frequencies
    inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

    ################################ NEW ###############################################
    # Frequency adjustments
    if freq_config is not None:
        low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
        high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]

        wavelen = 2 * torch.pi / inv_freq

        inv_freq_llama = torch.where(
            wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
        )

        smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
            freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
        )

        smoothed_inv_freq = (
            (1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
        )

        is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
        inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
        inv_freq = inv_freq_llama
    ####################################################################################


    # 生成 position indices
    positions = torch.arange(context_length)

    # 计算 这个 angles
    angles = positions[:, None] * inv_freq[None, :]  # Shape: (context_length, head_dim // 2)

    # Expand angles to match 这个 head_dim
    angles = torch.cat([angles, angles], dim=1)  # Shape: (context_length, head_dim)

    # Precompute sine 和 cosine
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    return cos, sin

- To summarize, 什么's new so far for Llama 3 compared to Llama 2 are 这个 context length 和 theta base 参数:

In [6]:
# Instantiate RoPE parameters

llama_2_context_len = 4096
llama_3_context_len = 8192

llama_2_theta_base = 10_000
llama_3_theta_base = 500_000

- 这个 usage remains 这个 same as before in Llama 2:

In [7]:
# Settings
batch_size = 2
num_heads = 4
head_dim = 16

# Instantiate RoPE parameters
cos, sin = precompute_rope_params(
    head_dim=head_dim,
    theta_base=llama_3_theta_base,
    context_length=llama_3_context_len
)

# Dummy query 和 key tensors
torch.manual_seed(123)
queries = torch.randn(batch_size, num_heads, llama_3_context_len, head_dim)
keys = torch.randn(batch_size, num_heads, llama_3_context_len, head_dim)

# 应用 rotary position embeddings
queries_rot = compute_rope(queries, cos, sin)
keys_rot = compute_rope(keys, cos, sin)

&nbsp;
## 1.3 Grouped-query 注意力机制

- In 这个 section, 我们 replace multi-head 注意力机制 (MHA) with 一个 alternative mechanism called grouped-query 注意力机制 (GQA)
- In short, one can think of GQA as 一个 more 计算- 和 参数-efficient version of MHA
- In GQA, 我们 reduce 这个 number of key 和 value projections by sharing them among multiple 注意力机制 heads
- Each 注意力机制 head still has its unique query, 但是 these queries attend to 这个 same group of keys 和 values
- Below is 一个 illustration of GQA with 2 key-value-groups (kv-groups):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/grouped-query-注意力机制.webp" width="500px">


- 这个 main idea behind GQA is to reduce 这个 number of unique query groups 那个 attend to 这个 key-value pairs, reducing 这个 size of some of 这个 matrix multiplications 和 这个 number of parameters in MHA without significantly reducing modeling 性能
- 这个 GQA 代码 is very similar to MHA (I highlighted 这个 changes below via 这个 "NEW" sections)
- In short, 这个 main 改变 in GQA is 那个 each query group needs to be repeated to match 这个 number of heads 它 is associated with, as implemented below

- In addition, 我们 also introduce 一个 `SharedBuffers` 类 那个 will allow us to reuse 这个 `mask`, `cos`, 和 `sin` tensors in 这个 Transformer blocks to improve efficiency (这个 will be crucial 当 working with models such as Llama 3.1 和 3.2 later, 哪个 support up to 131k 输入 tokens)

In [8]:
import torch.nn as nn


############################# NEW  #############################
class SharedBuffers:
    _buffers = {}

    @staticmethod
    def get_buffers(context_length, head_dim, rope_base, freq_config, dtype=torch.float32):
        key = (context_length, head_dim, rope_base, tuple(freq_config.values()) if freq_config else freq_config, dtype)

        if key not in SharedBuffers._buffers:
            # 创建 或者 fetch 这个 buffers
            mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
            cos, sin = precompute_rope_params(head_dim, rope_base, context_length, freq_config)
            if dtype is not None:
                cos = cos.to(dtype)
                sin = sin.to(dtype)
            SharedBuffers._buffers[key] = (mask, cos, sin)

        return SharedBuffers._buffers[key]
############################# NEW  #############################


class GroupedQueryAttention(nn.Module):
    def __init__(
            self, d_in, d_out, context_length, num_heads,
            num_kv_groups,       # NEW
            rope_base=10_000,    # NEW
            rope_config=None,    # NEW
            dtype=None
        ):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"  # NEW

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        ############################# NEW  #############################
        # self.W_key = nn.Linear(d_in, d_out, 偏置=False, dtype=dtype)
        # self.W_value = nn.Linear(d_in, d_out, 偏置=False, dtype=dtype)
        self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.num_kv_groups = num_kv_groups
        self.group_size = num_heads // num_kv_groups
        ################################################################

        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)

        ############################# NEW  #############################
        # Fetch buffers using SharedBuffers
        mask, cos, sin = SharedBuffers.get_buffers(context_length, self.head_dim, rope_base, rope_config, dtype)
        ############################# NEW  #############################
        
        self.register_buffer("mask", mask)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        queries = self.W_query(x)  # Shape: (b, num_tokens, d_out)
        keys = self.W_key(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)
        values = self.W_value(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)

        # Reshape queries, keys, 和 values
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        ##################### NEW  #####################
        # keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        # values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)
        values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)
        ################################################

        # Transpose keys, values, 和 queries
        keys = keys.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        values = values.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        queries = queries.transpose(1, 2)  # Shape: (b, num_query_groups, num_tokens, head_dim)

        # 应用 RoPE
        keys = compute_rope(keys, self.cos, self.sin)
        queries = compute_rope(queries, self.cos, self.sin)

        ##################### NEW  #####################
        # Expand keys 和 values to match 这个 number of heads
        # Shape: (b, num_heads, num_tokens, head_dim)

        keys = keys.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)
        values = values.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)
        # For 示例, before repeat_interleave along dim=1 (query groups):
        #   [K1, K2]
        # After repeat_interleave (each query group is repeated group_size times):
        #   [K1, K1, K2, K2]
        # 如果 我们 used regular repeat instead of repeat_interleave, 我们'd 获取:
        #   [K1, K2, K1, K2]
        ################################################

        # 计算 scaled dot-product 注意力机制 (aka self-注意力机制) with 一个 causal mask
        # Shape: (b, num_heads, num_tokens, num_tokens)
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to 这个 number of tokens 和 converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # 使用 这个 mask to fill 注意力机制 scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        assert keys.shape[-1] == self.head_dim

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, 哪里 self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.reshape(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # optional projection

        return context_vec

- To illustrate 这个 参数 savings, consider 这个 following multi-head 注意力机制 示例 from 这个 GPT 和 Llama 2 代码:

In [9]:
# Settings
batch_size = 1
context_len = 3000
max_context_len = 8192
embed_dim = 4096
num_heads = 32


example_batch = torch.randn((batch_size, context_len, embed_dim))

mha = MultiHeadAttention(
    d_in=embed_dim,
    d_out=embed_dim,
    context_length=max_context_len,
    num_heads=num_heads
)

mha(example_batch)

print("W_key:", mha.W_key.weight.shape)
print("W_value:", mha.W_value.weight.shape)
print("W_query:", mha.W_query.weight.shape)

W_key: torch.Size([4096, 4096])
W_value: torch.Size([4096, 4096])
W_query: torch.Size([4096, 4096])


- 现在, 如果 我们 使用 grouped-query 注意力机制 instead, with 8 kv-groups (那个's 如何 many Llama 3 8B uses), 我们 can see 那个 这个 number of rows of 这个 key 和 value matrices are reduced by 一个 factor of 4 (because 32 注意力机制 heads divided by 8 kv-groups is 4)

In [10]:
gqa = GroupedQueryAttention(
    d_in=embed_dim,
    d_out=embed_dim,
    context_length=max_context_len,
    num_heads=num_heads,
    num_kv_groups=8,
    rope_base=llama_3_theta_base
)

gqa(example_batch)

print("W_key:", gqa.W_key.weight.shape)
print("W_value:", gqa.W_value.weight.shape)
print("W_query:", gqa.W_query.weight.shape)

W_key: torch.Size([1024, 4096])
W_value: torch.Size([1024, 4096])
W_query: torch.Size([4096, 4096])


- As 一个 side note, to make 这个 GroupedQueryAttention equivalent to standard multi-head 注意力机制, 你 can 设置 这个 number of query groups (`num_kv_groups`) equal to 这个 number of heads (`num_heads`)
- Lastly, 让我们 compare 这个 number of parameters below:

In [11]:
print("Total number of parameters:")

mha_total_params = sum(p.numel() for p in mha.parameters())
print(f"MHA: {mha_total_params:,}")

gqa_total_params = sum(p.numel() for p in gqa.parameters())
print(f"GQA: {gqa_total_params:,}")

Total number of parameters:
MHA: 67,108,864
GQA: 41,943,040


In [12]:
# Free up memory:
del mha
del gqa

&nbsp;
## 1.4 更新 这个 TransformerBlock 模块

- 接下来, 我们 更新 这个 `TransformerBlock`
- 这里, 我们 simply swap `MultiHeadAttention` with `GroupedQueryAttention` 和 添加 这个 new RoPE settings

In [13]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att =  GroupedQueryAttention(  # MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            num_kv_groups=cfg["n_kv_groups"],  # NEW
            rope_base=cfg["rope_base"],        # NEW
            rope_config=cfg["rope_freq"],      # NEW
            dtype=cfg["dtype"]
        )
        self.ff = FeedForward(cfg)
        self.norm1 = RMSNorm(cfg["emb_dim"], eps=1e-5)
        self.norm2 = RMSNorm(cfg["emb_dim"], eps=1e-5)

    def forward(self, x):
        # Shortcut connection for 注意力机制 block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x.to(torch.bfloat16))   # Shape [batch_size, num_tokens, emb_size]
        x = x + shortcut  # 添加 这个 original 输入 back

        # Shortcut connection for feed-forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x.to(torch.bfloat16))
        x = x + shortcut  # 添加 这个 original 输入 back

        return x

&nbsp;
## 1.5 Defining 这个 模型 类

- 当 setting up 这个 模型 类, 我们 fortunately don't have to do much; 我们 just 更新 这个 name to `Llama3Model`

In [14]:
# 类 Llama2Model(nn.模块):
class Llama3Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = RMSNorm(cfg["emb_dim"], eps=1e-5)
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        tok_embeds = self.tok_emb(in_idx)
        x = tok_embeds
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x.to(torch.bfloat16))
        return logits

&nbsp;
## 2. 初始化 模型

- 现在 我们 can 定义 一个 Llama 3 config file (这个 Llama 2 config file is shown for comparison)

In [15]:
LLAMA2_CONFIG_7B = {
    "vocab_size": 32_000,    # Vocabulary size
    "context_length": 4096,  # Context length
    "emb_dim": 4096,         # 嵌入 dimension
    "n_heads": 32,           # Number of 注意力机制 heads
    "n_layers": 32,          # Number of layers
    "hidden_dim": 11_008,    # Size of 这个 intermediate dimension in FeedForward
    "dtype": torch.bfloat16  # Lower-precision dtype to reduce memory usage
}

In [16]:
LLAMA3_CONFIG_8B = {
    "vocab_size": 128_256,   # NEW: Larger vocabulary size
    "context_length": 8192,  # NEW: Larger context length
    "emb_dim": 4096,         # 嵌入 dimension
    "n_heads": 32,           # Number of 注意力机制 heads
    "n_layers": 32,          # Number of layers
    "hidden_dim": 14_336,    # NEW: Larger size of 这个 intermediate dimension in FeedForward
    "n_kv_groups": 8,        # NEW: Key-Value groups for grouped-query 注意力机制
    "rope_base": 500_000.0,  # NEW: 这个 base in RoPE's "theta" was increased to 500_000
    "rope_freq": None,       # NEW: Additional 配置 for adjusting 这个 RoPE frequencies
    "dtype": torch.bfloat16  # Lower-precision dtype to reduce memory usage
}

- Using these settings, 我们 can 现在 初始化 一个 Llama 3 8B 模型
- Note 那个 这个 requires ~34 GB of memory (for comparison, Llama 2 7B required ~26 GB of memory)

In [17]:
model = Llama3Model(LLAMA3_CONFIG_8B)

- 这个 following is expected to 打印 True to confirm buffers are reused instead of being (wastefully) recreated:

In [None]:
# 检查 buffers
print(model.trf_blocks[0].att.mask is model.trf_blocks[-1].att.mask)
print(model.trf_blocks[0].att.cos is model.trf_blocks[-1].att.cos)
print(model.trf_blocks[0].att.sin is model.trf_blocks[-1].att.sin) 

- 让我们 现在 also 计算 这个 number of trainable parameters:

In [18]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 8,030,261,248


- As shown above, 这个 模型 contains 8 billion parameters
- Additionally, 我们 can 计算 这个 memory 依赖 for 这个 模型 using 这个 代码 below:

In [19]:
def model_memory_size(model, input_dtype=torch.float32):
    total_params = 0
    total_grads = 0
    for param in model.parameters():
        # 计算 total number of elements per 参数
        param_size = param.numel()
        total_params += param_size
        # 检查 如果 gradients are stored for 这个 参数
        if param.requires_grad:
            total_grads += param_size

    # 计算 buffer size (non-parameters 那个 require memory)
    total_buffers = sum(buf.numel() for buf in model.buffers())

    # Size in bytes = (Number of elements) * (Size of each element in bytes)
    # 我们 assume parameters 和 gradients are stored in 这个 same type as 输入 dtype
    element_size = torch.tensor(0, dtype=input_dtype).element_size()
    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size

    # 转换 bytes to gigabytes
    total_memory_gb = total_memory_bytes / (1024**3)

    return total_memory_gb

print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")
print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")

float32 (PyTorch default): 68.08 GB
bfloat16: 34.04 GB


- Lastly, 我们 can also transfer 这个 模型 to 一个 NVIDIA 或者 Apple Silicon GPU 如果 applicable:

In [20]:
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device);

&nbsp;
## 3. 加载 分词器

- In 这个 section, 我们 are going to 加载 这个 分词器 for 这个 模型
- Llama 2 used Google's [SentencePiece](https://github.com/google/sentencepiece) 分词器 instead of OpenAI's BPE 分词器 based on 这个 [Tiktoken](https://github.com/openai/tiktoken) 库
- Llama 3, however, reverted back to using 这个 BPE 分词器 from Tiktoken; specifically, 它 uses 这个 GPT-4 分词器 with 一个 extended vocabulary
- 你 can find 这个 original Tiktoken-adaptation by Meta AI [这里](https://github.com/meta-llama/llama3/blob/main/llama/分词器.py) in their official Llama 3 repository
- Below, I rewrote 这个 分词器 代码 to make 它 more readable 和 minimal for 这个 笔记本 (但是 这个 behavior should be similar)

In [21]:
from pathlib import Path

import tiktoken
from tiktoken.load import load_tiktoken_bpe


class Tokenizer:
    def __init__(self, model_path):
        assert os.path.isfile(model_path), f"Model file {model_path} not found"
        mergeable_ranks = load_tiktoken_bpe(model_path)

        self.special_tokens = {
            "<|begin_of_text|>": 128000,
            "<|end_of_text|>": 128001,
            "<|start_header_id|>": 128006,
            "<|end_header_id|>": 128007,
            "<|eot_id|>": 128009,
        }
        self.special_tokens.update({
            f"<|reserved_{i}|>": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()
        })

        self.model = tiktoken.Encoding(
            name=Path(model_path).name,
            pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
            mergeable_ranks=mergeable_ranks,
            special_tokens=self.special_tokens
        )


    def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):
        if bos:
            tokens = [self.special_tokens["<|begin_of_text|>"]]
        else:
            tokens = []

        tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)

        if eos:
            tokens.append(self.special_tokens["<|end_of_text|>"])
        return tokens

    def decode(self, tokens):
        return self.model.decode(tokens)

- Meta AI shared 这个 original Llama 3 模型 weights 和 分词器 vocabulary on 这个 Hugging Face Hub
- 我们 will 首先 download 这个 分词器 vocabulary from 这个 Hub 和 加载 它 into 这个 代码 above

- Please note 那个 Meta AI requires 那个 你 accept 这个 Llama 3 licensing terms before 你 can download 这个 files; to do 这个, 你 have to 创建 一个 Hugging Face Hub account 和 visit 这个 [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) repository to accept 这个 terms
- 接下来, 你 will need to 创建 一个 access 词元; to 生成 一个 access 词元 with READ permissions, click on 这个 profile picture in 这个 upper right 和 click on "Settings"


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/settings.webp?1" width="300px">

- 然后, 创建 和 copy 这个 access 词元 so 你 can copy & paste 它 into 这个 接下来 代码 cell

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/access-词元.webp?1" width="600px">

In [22]:
from huggingface_hub import login
import json

with open("config.json", "r") as config_file:
    config = json.load(config_file)
    access_token = config["HF_ACCESS_TOKEN"]

login(token=access_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


- After login via 这个 access 词元, 哪个 is necessary to 验证 那个 我们 accepted 这个 Llama 3 licensing terms, 我们 can 现在 download 这个 分词器 vocabulary:

In [23]:
from huggingface_hub import hf_hub_download

tokenizer_file_path = hf_hub_download(
    repo_id="meta-llama/Meta-Llama-3-8B",
    filename="original/tokenizer.model",
    local_dir="Llama-3-8B"
)

- Note 那个 for using Llama 3 files, 我们 may need 这个 `blobfile` 包, 哪个 is used 当 handling datasets 或者 models stored in cloud storage 解答 like Google Cloud Storage (GCS), Azure Blob Storage, 或者 Amazon S3
- 你 can 安装 这个 dependency by uncommenting 和 executing 这个 `pip` command below


In [24]:
# pip 安装 blobfile

In [25]:
tokenizer = Tokenizer(tokenizer_file_path)

- 我们 can 现在 使用 这个 `生成` 函数 to have 这个 Llama 3 模型 生成 new text:

In [26]:
from previous_chapters import generate, text_to_token_ids, token_ids_to_text
# 如果 这个 `previous_chapters.py` file is not available locally,
# 你 can 导入 它 from 这个 `llms-from-scratch` PyPI 包.
# For details, see: https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg
# E.g.,
# from llms_from_scratch.ch05 导入 生成, text_to_token_ids, token_ids_to_text


torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort", tokenizer).to(device),
    max_new_tokens=30,
    context_size=LLAMA3_CONFIG_8B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort_dead aeros Ingredients başında.extensionégor clangmissions güc như submodule.and report官方%，.Reader(",");
ामल ندار Parliamentary !!! HigginsDynamicZhgmt writeln Globalsletion 사진------


- Of course, as 我们 can see above, 这个 text is nonsensical since 我们 haven't trained 这个 Llama 3 模型 yet
- In 这个 接下来 section, instead of 训练 它 ourselves, 哪个 would cost tens to hundreds of thousands of dollars, 我们 加载 这个 pretrained weights from Meta AI

&nbsp;
## 4. 加载 pretrained weights

- 我们 are loading 这个 ["meta-llama/Meta-Llama-3-8B"](https://huggingface.co/meta-llama/Meta-Llama-3-8B) base 模型 below, 哪个 is 一个 simple text completion 模型 before finetuning
- Alternatively, 你 can 加载 这个 instruction-finetuned 和 aligned ["meta-llama/Meta-Llama-3-8B-Instruct"](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) 模型 by modifying 这个 string in 这个 接下来 代码 cell accordingly
- Combined, 这个 权重 files are about 16 GB large

In [27]:
from safetensors.torch import load_file

combined_weights = {}

for i in range(1, 5):
    weights_file = hf_hub_download(
        repo_id="meta-llama/Meta-Llama-3-8B",
        filename=f"model-0000{i}-of-00004.safetensors",
        local_dir="Llama-3-8B"
    )
    current_weights = load_file(weights_file)
    combined_weights.update(current_weights)

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

- 这个 `weights` contains 这个 following tensors (only 这个 首先 15 are shown for simplicity):

In [28]:
list(combined_weights.keys())[:15]

['model.embed_tokens.weight',
 'model.layers.0.input_layernorm.weight',
 'model.layers.0.mlp.down_proj.weight',
 'model.layers.0.mlp.gate_proj.weight',
 'model.layers.0.mlp.up_proj.weight',
 'model.layers.0.post_attention_layernorm.weight',
 'model.layers.0.self_attn.k_proj.weight',
 'model.layers.0.self_attn.o_proj.weight',
 'model.layers.0.self_attn.q_proj.weight',
 'model.layers.0.self_attn.v_proj.weight',
 'model.layers.1.input_layernorm.weight',
 'model.layers.1.mlp.down_proj.weight',
 'model.layers.1.mlp.gate_proj.weight',
 'model.layers.1.mlp.up_proj.weight',
 'model.layers.1.post_attention_layernorm.weight']

- 这个 following 函数, modeled after 这个 `load_weights_into_gpt` 函数 in [第 5](../01_main-第-代码/ch05.ipynb), loads 这个 pretrained weights into our Llama 3 模型:

In [29]:
def assign(left, right, tensor_name="unknown"):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}")

    if isinstance(right, torch.Tensor):
        return torch.nn.Parameter(right.clone().detach())
    else:
        return torch.nn.Parameter(torch.tensor(right))


def load_weights_into_llama(model, param_config, params):
    model.tok_emb.weight = assign(model.tok_emb.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")

    for l in range(param_config["n_layers"]):

        # 加载 注意力机制 weights
        model.trf_blocks[l].att.W_query.weight = assign(
            model.trf_blocks[l].att.W_query.weight,
            params[f"model.layers.{l}.self_attn.q_proj.weight"],
            f"model.layers.{l}.self_attn.q_proj.weight"
        )
        model.trf_blocks[l].att.W_key.weight = assign(
            model.trf_blocks[l].att.W_key.weight,
            params[f"model.layers.{l}.self_attn.k_proj.weight"],
            f"model.layers.{l}.self_attn.k_proj.weight"
        )
        model.trf_blocks[l].att.W_value.weight = assign(
            model.trf_blocks[l].att.W_value.weight,
            params[f"model.layers.{l}.self_attn.v_proj.weight"],
            f"model.layers.{l}.self_attn.v_proj.weight"
        )
        model.trf_blocks[l].att.out_proj.weight = assign(
            model.trf_blocks[l].att.out_proj.weight,
            params[f"model.layers.{l}.self_attn.o_proj.weight"],
            f"model.layers.{l}.self_attn.o_proj.weight"
        )
        model.trf_blocks[l].norm1.weight = assign(
            model.trf_blocks[l].norm1.weight,
            params[f"model.layers.{l}.input_layernorm.weight"],
            f"model.layers.{l}.input_layernorm.weight"
        )

        # 加载 FeedForward weights
        model.trf_blocks[l].ff.fc1.weight = assign(
            model.trf_blocks[l].ff.fc1.weight,
            params[f"model.layers.{l}.mlp.gate_proj.weight"],
            f"model.layers.{l}.mlp.gate_proj.weight"
        )
        model.trf_blocks[l].ff.fc2.weight = assign(
            model.trf_blocks[l].ff.fc2.weight,
            params[f"model.layers.{l}.mlp.up_proj.weight"],
            f"model.layers.{l}.mlp.up_proj.weight"
        )
        model.trf_blocks[l].ff.fc3.weight = assign(
            model.trf_blocks[l].ff.fc3.weight,
            params[f"model.layers.{l}.mlp.down_proj.weight"],
            f"model.layers.{l}.mlp.down_proj.weight"
        )
        model.trf_blocks[l].norm2.weight = assign(
            model.trf_blocks[l].norm2.weight,
            params[f"model.layers.{l}.post_attention_layernorm.weight"],
            f"model.layers.{l}.post_attention_layernorm.weight"
        )

    # 加载 输出 层 weights
    model.final_norm.weight = assign(model.final_norm.weight, params["model.norm.weight"], "model.norm.weight")

    if "lm_head.weight" in params.keys():
        model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")
    else:
        model.out_head.weight = assign(model.out_head.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
        print("Model uses weight tying.")


load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
model.to(device);
del combined_weights  # free up memory

- 接下来, 我们 are ready to 使用 这个 模型 for text generation

In [30]:
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort", tokenizer).to(device),
    max_new_tokens=25,
    context_size=LLAMA3_CONFIG_8B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any


&nbsp;
## 5. Using 这个 instruction-finetuned 模型

- Above, 我们 used 这个 pretrained base 模型; 如果 你 want to 使用 一个 模型 capable of following instructions, 使用 这个 `"meta-llama/Llama-3-8B-Instruct"` 模型 instead, as shown below

In [31]:
# to free up memory

import gc

del model

gc.collect()  # 运行 Python garbage collector

if torch.cuda.is_available():
    torch.cuda.empty_cache()

In [32]:
combined_weights = {}

for i in range(1, 5):
    weights_file = hf_hub_download(
        repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
        filename=f"model-0000{i}-of-00004.safetensors",
        local_dir="Llama-3-8B-Instruct"
    )
    current_weights = load_file(weights_file)
    combined_weights.update(current_weights)


model = Llama3Model(LLAMA3_CONFIG_8B)
load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
model.to(device)
del combined_weights  # free up memory

model-00001-of-00004.safetensors:  36%|###6      | 1.81G/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

- Note 那个 这个 Llama 3 模型 should ideally be used with 这个 correct prompt template 那个 was used during finetuning (as discussed in 第 7)
- Below is 一个 wrapper 类 around 这个 分词器 based on Meta AI's Llama 3-specific [ChatFormat 代码](https://github.com/meta-llama/llama3/blob/11817d47e1ba7a4959b025eb1ca308572e0e3963/llama/分词器.py#L202) 那个 constructs 这个 prompt template

In [33]:
class ChatFormat:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def encode_header(self, message):
        tokens = []
        tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
        tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
        return tokens

    def encode(self, text):
        message = {
            "role": "user",
            "content": text
        }

        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
        )
        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
        return tokens

    def decode(self, token_ids):
        return self.tokenizer.decode(token_ids)


chat_tokenizer = ChatFormat(tokenizer)

- 这个 usage is as follows:

In [34]:
token_ids = chat_tokenizer.encode("Hello World!")
print(token_ids)

[128006, 882, 128007, 271, 9906, 4435, 0, 128009]


In [35]:
tokenizer.decode(token_ids)

'<|start_header_id|>user<|end_header_id|>\n\nHello World!<|eot_id|>'

- 让我们 现在 see 这个 Llama 3 instruction 模型 in action:

In [36]:
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("What do llamas eat?", chat_tokenizer).to(device),
    max_new_tokens=150,
    context_size=LLAMA3_CONFIG_8B["context_length"],
    top_k=1,
    temperature=0.
)

output_text = token_ids_to_text(token_ids, tokenizer)


def clean_text(text, header_end="assistant<|end_header_id|>\n\n"):
    # Find 这个 index of 这个 首先 occurrence of "<|end_header_id|>"
    index = text.find(header_end)

    if index != -1:
        # 返回 这个 substring starting after "<|end_header_id|>"
        return text[index + len(header_end):].strip()  # Strip removes leading/trailing whitespace
    else:
        # 如果 这个 词元 is not found, 返回 这个 original text
        return text

print("Output text:\n", clean_text(output_text))

Output text:
 Llamas are herbivores, which means they primarily eat plants and plant-based foods. Here are some of the things llamas like to eat:

1. Grass: Llamas love to graze on grass, especially in the spring and summer months.
2. Hay: Hay is a staple in a llama's diet. They like to eat timothy hay, alfalfa hay, and other types of hay.
3. Grains: Llamas may also be fed grains like oats, barley, and corn. However, grains should not make up more than 10-15% of a llama's diet.
4. Fruits and vegetables: Llamas may enjoy fruits and vegetables as treats, such as


&nbsp;
# Llama 3.1 8B

- 一个 few months after 这个 initial Llama 3 release, Meta AI followed up with their Llama 3.1 suite of models (see 这个 official [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/) announcement blog post for details)
- Conveniently, 我们 can reuse our previous Llama 3 代码 from above to 实现 Llama 3.1 8B

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/llama3-to-llama31.webp" width="700px">

- 这个 architecture is identical, with 这个 only 改变 being 一个 rescaling of 这个 RoPE frequencies as indicated in 这个 配置 file below



In [37]:
LLAMA3_CONFIG_8B = {
    "vocab_size": 128_256,   # Vocabulary size
    "context_length": 8192,  # Context length
    "emb_dim": 4096,         # 嵌入 dimension
    "n_heads": 32,           # Number of 注意力机制 heads
    "n_layers": 32,          # Number of layers
    "hidden_dim": 14_336,    # Size of 这个 intermediate dimension in FeedForward
    "n_kv_groups": 8,        # Key-Value groups for grouped-query 注意力机制
    "rope_base": 500_000.0,  # 这个 base in RoPE's "theta"
    "rope_freq": None,       # Additional 配置 for adjusting 这个 RoPE frequencies
    "dtype": torch.bfloat16  # Lower-precision dtype to reduce memory usage
}

LLAMA31_CONFIG_8B = {
    "vocab_size": 128_256,      # Vocabulary size
    "context_length": 131_072,  # NEW: Larger supported context length
    "emb_dim": 4096,            # 嵌入 dimension
    "n_heads": 32,              # Number of 注意力机制 heads
    "n_layers": 32,             # Number of layers
    "hidden_dim": 14_336,       # Size of 这个 intermediate dimension in FeedForward
    "n_kv_groups": 8,           # Key-Value groups for grouped-query 注意力机制
    "rope_base": 500_000.0,     # 这个 base in RoPE's "theta"
    "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
    "rope_freq": {              # NEW: RoPE frequency scaling
        "factor": 8.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}

- Reduce 这个 context length so 这个 模型 would work fine on 一个 MacBook Air (如果 你 have more RAM, feel free to comment out 这个 lines below):

In [10]:
old_context_length = LLAMA31_CONFIG_8B["context_length"]
LLAMA31_CONFIG_8B["context_length"] = 8192


def rescale_theta(theta_old, context_length_old, context_length_new):
    scaling_factor = context_length_new / context_length_old
    theta_new = theta_old * scaling_factor
    return theta_new

LLAMA31_CONFIG_8B["rope_base"] = rescale_theta(
    LLAMA31_CONFIG_8B["rope_base"],
    old_context_length,
    LLAMA31_CONFIG_8B["context_length"]
)

print("New RoPE theta:", LLAMA31_CONFIG_8B["rope_base"])

New RoPE theta: 31250.0


- As 我们've seen in 这个 代码 earlier, 这个 RoPE 方法 uses sinusoidal functions (sine 和 cosine) to embed positional information directly into 这个 注意力机制 mechanism
- In Llama 3.1, via 这个 additional 配置, 我们 introduce additional adjustments to 这个 inverse frequency calculations
- These adjustments influence 如何 different frequency components contribute to 这个 positional embeddings (一个 detailed explanation is 一个 topic for another time)
- 让我们 try out 这个 Llama 3.1 模型 in practice; 首先, 我们 清除 out 这个 old 模型 to free up some GPU memory

In [38]:
# free up memory
del model

gc.collect()  # 运行 Python garbage collector

if torch.cuda.is_available():
    torch.cuda.empty_cache()

- 接下来, 我们 download 这个 分词器
- Note 那个 since 这个 Llama 3.1 family is distinct from 这个 Llama 3 family, 你'd have to go to 这个 [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) repository 和 acknowledge 这个 license terms for your Hugging Face access 词元 to work for 这个 download
- Tip: For simplicity, 我们 only 加载 这个 base 模型 below, 但是 那里's also 一个 instruction-finetuned version 你 can 使用 by replacing `"meta-llama/Llama-3.1-8B"` with `"meta-llama/Llama-3.1-8B-Instruct"`

In [39]:
tokenizer_file_path = hf_hub_download(
    repo_id="meta-llama/Llama-3.1-8B",
    filename="original/tokenizer.model",
    local_dir="Llama-3.1-8B"
)

tokenizer = Tokenizer(tokenizer_file_path)

In [40]:
model = Llama3Model(LLAMA31_CONFIG_8B)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 8,030,261,248


In [41]:
combined_weights = {}

for i in range(1, 5):
    weights_file = hf_hub_download(
        repo_id="meta-llama/Llama-3.1-8B",
        filename=f"model-0000{i}-of-00004.safetensors",
        local_dir="Llama-3.1-8B"
    )
    current_weights = load_file(weights_file)
    combined_weights.update(current_weights)

load_weights_into_llama(model, LLAMA31_CONFIG_8B, combined_weights)
model.to(device);
del combined_weights  # free up memory

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

In [42]:
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort", tokenizer).to(device),
    max_new_tokens=25,
    context_size=LLAMA31_CONFIG_8B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any


&nbsp;
# Llama 3.2 1B

- As of 这个 writing, Meta AI's latest models are 这个 Llama 3.2 models announced [这里](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)
- 这个 代码 for 这个 Llama 3.2 text 模型 is similar to 那个 of Llama 3.1, except 那个 这个 模型 has shrunk in size (那里 is 一个 1B 和 3B version)
- 这个 other efficiency tweak was 那个 they added back 权重 tying (一个 concept 那个 was original used in 这个 GPT-2 architecture); 这里, they reuse 这个 same 权重 参数 values in 这个 输入 (词元) 嵌入 层 和 输出 层
- 这个 small 模型 size of Llama 3.2 1B is quite convenient, since 它 can even 运行 on many mobile devices
- 这个 architectural differences between Llama 3.1 8B 和 Llama 3.2 1B are illustrated in 这个 figure below

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/GPT-to-llama/llama31-to-llama32.webp?1" width="700px">

- As 我们 can see based on 这个 figure above, 这个 main difference between 这个 Llama 3.1 8B 和 Llama 3.2 1B architectures are 这个 respective sizes
- 一个 small additional 改变 is 一个 increased RoPE rescaling factor, 哪个 is reflected in 这个 配置 file below

In [43]:
LLAMA31_CONFIG_8B = {
    "vocab_size": 128_256,      # Vocabulary size
    "context_length": 131_072,  # NEW: Larger supported context length
    "emb_dim": 4096,            # 嵌入 dimension
    "n_heads": 32,              # Number of 注意力机制 heads
    "n_layers": 32,             # Number of layers
    "hidden_dim": 14_336,       # Size of 这个 intermediate dimension in FeedForward
    "n_kv_groups": 8,           # Key-Value groups for grouped-query 注意力机制
    "rope_base": 500_000.0,     # 这个 base in RoPE's "theta"
    "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usagey
    "rope_freq": {              # NEW: RoPE frequency scaling
        "factor": 8.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}


LLAMA32_CONFIG_1B = {
    "vocab_size": 128_256,      # Vocabulary size
    "context_length": 131_072,  # Context length
    "emb_dim": 2048,            # NEW: Half 这个 嵌入 dimension
    "n_heads": 32,              # Number of 注意力机制 heads
    "n_layers": 16,             # NEW: Half 这个 number of layers
    "hidden_dim": 8192,         # NEW: Almost half 这个 size of 这个 intermediate dimension in FeedForward
    "n_kv_groups": 8,           # Key-Value groups for grouped-query 注意力机制
    "rope_base": 500_000.0,     # 这个 base in RoPE's "theta"
    "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
    "rope_freq": {              # RoPE frequency scaling
        "factor": 32.0,         # NEW: Adjustment of 这个 rescaling factor
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}

- Reduce 这个 context length so 这个 模型 would work fine on 一个 MacBook Air (如果 你 have more RAM, feel free to comment out 这个 lines below):

In [10]:
old_context_length = LLAMA32_CONFIG_1B["context_length"]
LLAMA32_CONFIG_1B["context_length"] = 8192

LLAMA32_CONFIG_1B["rope_base"] = rescale_theta(
    LLAMA32_CONFIG_1B["rope_base"],
    old_context_length,
    LLAMA32_CONFIG_1B["context_length"]
)

print("New RoPE theta:", LLAMA32_CONFIG_1B["rope_base"])

New RoPE theta: 31250.0


- Below, 我们 can reuse 这个 代码 from 这个 Llama 3.1 8B section to 加载 这个 Llama 3.2 1B 模型
- Again, since 这个 Llama 3.2 family is distinct from 这个 Llama 3.1 family, 你'd have to go to 这个 [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) repository 和 acknowledge 这个 license terms for your Hugging Face access 词元 to work for 这个 download
- Tip: For simplicity, 我们 only 加载 这个 base 模型 below, 但是 那里's also 一个 instruction-finetuned version 你 can 使用 by replacing `"meta-llama/Llama-3.2-1B"` with `"meta-llama/Llama-3.2-1B-Instruct"`

In [44]:
# free up memory
del model


gc.collect()  # 运行 Python garbage collector

if torch.cuda.is_available():
    torch.cuda.empty_cache()

In [45]:
tokenizer_file_path = hf_hub_download(
    repo_id="meta-llama/Llama-3.2-1B",
    filename="original/tokenizer.model",
    local_dir="Llama-3.2-1B"
)

tokenizer = Tokenizer(tokenizer_file_path)

In [46]:
model = Llama3Model(LLAMA32_CONFIG_1B)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

# Account for 权重 tying
total_params_normalized = total_params - model.tok_emb.weight.numel()
print(f"\nTotal number of unique parameters: {total_params_normalized:,}")

Total number of parameters: 1,498,482,688

Total number of unique parameters: 1,235,814,400


In [47]:
weights_file = hf_hub_download(
    repo_id="meta-llama/Llama-3.2-1B",
    filename="model.safetensors",
    local_dir="Llama-3.2-1B"
)
current_weights = load_file(weights_file)

load_weights_into_llama(model, LLAMA32_CONFIG_1B, current_weights)
model.to(device);
del current_weights  # free up memory

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Model uses weight tying.


In [48]:
print("Weight tying:", torch.equal(model.tok_emb.weight, model.out_head.weight))

Weight tying: True


In [49]:
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort", tokenizer).to(device),
    max_new_tokens=25,
    context_size=LLAMA32_CONFIG_1B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort is made to ensure that the information on this website is accurate. However, we cannot guarantee that the information is accurate, complete


&nbsp;
# 什么's 接下来?

- 这个 笔记本 concludes 这个 conversion from GPT to Llama 3.2
- 如果 你 are interested in 一个 more compact, standalone 笔记本, 哪个 only contains 这个 Llama 3.2 代码, 检查 out 这个 [standalone-llama32.ipynb](standalone-llama32.ipynb) 笔记本