<a href="https://colab.research.google.com/github/bobothebest/LLMs-from-scratch/blob/main/ch03/03_understanding-buffers/understanding-buffers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Understanding PyTorch Buffers

In essence, PyTorch buffers are tensor attributes associated with a PyTorch module or model similar to parameters, but unlike parameters, buffers are not updated during training.

本质上，PyTorch 缓冲区（buffers）是与 PyTorch 模块或模型关联的张量属性，类似于参数（parameters），但与参数不同的是，缓冲区在训练过程中不会被更新。

Buffers in PyTorch are particularly useful when dealing with GPU computations, as they need to be transferred between devices (like from CPU to GPU) alongside the model's parameters. Unlike parameters, buffers do not require gradient computation, but they still need to be on the correct device to ensure that all computations are performed correctly.

PyTorch 中的缓冲区在处理 GPU 计算时特别有用，因为它们需要与模型参数一起在设备之间传输（比如从 CPU 到 GPU）。与参数不同，缓冲区不需要梯度计算，但它们仍然需要在正确的设备上，以确保所有计算都能正确执行。

In chapter 3, we use PyTorch buffers via `self.register_buffer`, which is only briefly explained in the book. Since the concept and purpose are not immediately clear, this code notebook offers a longer explanation with a hands-on example.

在第三章中，我们通过 `self.register_buffer` 使用 PyTorch 缓冲区，书中只是简要解释了这一点。由于这个概念和用途并不是立即就能理解清楚的，这个代码笔记本提供了更详细的解释和实际操作示例。

## An example without buffers

Suppose we have the following code, which is based on code from chapter 3. This version has been modified to exclude buffers. It implements the causal self-attention mechanism used in LLMs:

假设我们有以下代码，它基于第三章的代码。这个版本已被修改为不包含缓冲区。它实现了大语言模型中使用的因果自注意力机制：

In [1]:
import torch
import torch.nn as nn

class CausalAttentionWithoutBuffers(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec

We can initialize and run the module as follows on some example data:  
我们可以在一些示例数据上初始化并运行该模块，如下所示：

In [2]:
torch.manual_seed(123)

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

batch = torch.stack((inputs, inputs), dim=0)
context_length = batch.shape[1]
d_in = inputs.shape[1]
d_out = 2

ca_without_buffer = CausalAttentionWithoutBuffers(d_in, d_out, context_length, 0.0)

with torch.no_grad():
    context_vecs = ca_without_buffer(batch)

print(context_vecs)

tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]])


So far, everything has worked fine so far.

到目前为止，一切都运行良好。

However, when training LLMs, we typically use GPUs to accelerate the process. Therefore, let's transfer the `CausalAttentionWithoutBuffers` module onto a GPU device.

然而，在训练大语言模型时，我们通常使用 GPU 来加速这个过程。因此，让我们将 `CausalAttentionWithoutBuffers` 模块转移到 GPU 设备上。

Please note that this operation requires the code to be run in an environment equipped with GPUs.

请注意，此操作需要在配备 GPU 的环境中运行代码。

In [3]:
print("Machine has GPU:", torch.cuda.is_available())

batch = batch.to("cuda")
ca_without_buffer.to("cuda");

Machine has GPU: True


Now, let's run the code again:

In [4]:
with torch.no_grad():
    context_vecs = ca_without_buffer(batch)

print(context_vecs)

RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0

Running the code resulted in an error. What happened? It seems like we attempted a matrix multiplication between a tensor on a GPU and a tensor on a CPU. But we moved the module to the GPU!?

运行代码时出现了错误。发生了什么？看起来我们试图在一个位于 GPU 上的张量和一个位于 CPU 上的张量之间进行矩阵乘法。但是我们已经把模块移动到 GPU 上了啊！

Let's double-check the device locations of some of the tensors:

让我们再仔细检查一下某些张量所在的设备位置：


In [5]:
print("W_query.device:", ca_without_buffer.W_query.weight.device)
print("mask.device:", ca_without_buffer.mask.device)

W_query.device: cuda:0
mask.device: cpu


In [6]:
type(ca_without_buffer.mask)

torch.Tensor

In [7]:
type(ca_without_buffer.W_key.weight)

torch.nn.parameter.Parameter

As we can see, the `mask` was not moved onto the GPU. That's because it's not a PyTorch parameter like the weights (e.g., `W_query.weight`).

正如我们所看到的，`mask` 没有被移动到 GPU 上。这是因为它不像权重（例如 `W_query.weight`）那样是一个 PyTorch 参数。

This means we  have to manually move it to the GPU via `.to("cuda")`:

这意味着我们必须通过 `.to("cuda")` 手动将其移动到 GPU：

In [None]:
ca_without_buffer.mask = ca_without_buffer.mask.to("cuda")
print("mask.device:", ca_without_buffer.mask.device)

mask.device: cuda:0


Let's try our code again:  
再试一次

In [None]:
with torch.no_grad():
    context_vecs = ca_without_buffer(batch)

print(context_vecs)

tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]], device='cuda:0')


This time, it worked!

这次成功了！

However, remembering to move individual tensors to the GPU can be tedious. As we will see in the next section, it's easier to use `register_buffer` to register the `mask` as a buffer.

然而，记住要将单个张量移动到 GPU 可能很繁琐。正如我们将在下一节中看到的，使用 `register_buffer` 将 `mask` 注册为缓冲区会更容易。

对于 PyTorch 参数来说，pytorch 认为参数是需要学习的东西，所以会一起传递到 GPU 中。  
而对于 mask，我们不想学这个参数，我们希望它成为一个缓冲区（buffer）

## An example with buffers

Let's now modify the causal attention class to register the causal `mask` as a buffer:  
现在让我们修改因果注意力类，将因果 `mask` 注册为缓冲区：

In [None]:
import torch
import torch.nn as nn

class CausalAttentionWithBuffer(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        # Old:
        # self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)

        # New:
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec

Now, conveniently, if we move the module to the GPU, the mask will be located on the GPU as well:

现在，方便的是，如果我们将模块移动到 GPU，掩码也会位于 GPU 上：

In [None]:
ca_with_buffer = CausalAttentionWithBuffer(d_in, d_out, context_length, 0.0)
ca_with_buffer.to("cuda")

print("W_query.device:", ca_with_buffer.W_query.weight.device)
print("mask.device:", ca_with_buffer.mask.device)

W_query.device: cuda:0
mask.device: cuda:0


In [None]:
with torch.no_grad():
    context_vecs = ca_with_buffer(batch)

print(context_vecs)

tensor([[[0.4772, 0.1063],
         [0.5891, 0.3257],
         [0.6202, 0.3860],
         [0.5478, 0.3589],
         [0.5321, 0.3428],
         [0.5077, 0.3493]],

        [[0.4772, 0.1063],
         [0.5891, 0.3257],
         [0.6202, 0.3860],
         [0.5478, 0.3589],
         [0.5321, 0.3428],
         [0.5077, 0.3493]]], device='cuda:0')


As we can see above, registering a tensor as a buffer can make our lives a lot easier: We don't have to remember to move tensors to a target device like a GPU manually.

正如我们在上面看到的，将张量注册为缓冲区可以让我们的生活变得轻松很多：我们不必记住手动将张量移动到目标设备（如 GPU）上。

## Buffers and `state_dict`

- Another advantage of PyTorch buffers, over regular tensors, is that they get included in a model's `state_dict`
- PyTorch 缓冲区相对于常规张量的另一个优势是，它们会被包含在模型的 `state_dict` 中
- For example, consider the `state_dict` of the causal attention object without buffers
- 例如，考虑不使用缓冲区的因果注意力对象的 `state_dict`：

In [None]:
ca_without_buffer.state_dict()

OrderedDict([('W_query.weight',
              tensor([[-0.2354,  0.0191, -0.2867],
                      [ 0.2177, -0.4919,  0.4232]], device='cuda:0')),
             ('W_key.weight',
              tensor([[-0.4196, -0.4590, -0.3648],
                      [ 0.2615, -0.2133,  0.2161]], device='cuda:0')),
             ('W_value.weight',
              tensor([[-0.4900, -0.3503, -0.2120],
                      [-0.1135, -0.4404,  0.3780]], device='cuda:0'))])

- The mask is not included in the `state_dict` above
- 掩码没有包含在上面的 `state_dict` 中
- However, the mask *is* included in the `state_dict` below, thanks to registering it as a buffer
- 然而，由于将其注册为缓冲区，掩码*被*包含在下面的 `state_dict` 中：

In [None]:
ca_with_buffer.state_dict()

OrderedDict([('mask',
              tensor([[0., 1., 1., 1., 1., 1.],
                      [0., 0., 1., 1., 1., 1.],
                      [0., 0., 0., 1., 1., 1.],
                      [0., 0., 0., 0., 1., 1.],
                      [0., 0., 0., 0., 0., 1.],
                      [0., 0., 0., 0., 0., 0.]], device='cuda:0')),
             ('W_query.weight',
              tensor([[-0.1362,  0.1853,  0.4083],
                      [ 0.1076,  0.1579,  0.5573]], device='cuda:0')),
             ('W_key.weight',
              tensor([[-0.2604,  0.1829, -0.2569],
                      [ 0.4126,  0.4611, -0.5323]], device='cuda:0')),
             ('W_value.weight',
              tensor([[ 0.4929,  0.2757,  0.2516],
                      [ 0.2377,  0.4800, -0.0762]], device='cuda:0'))])

- A `state_dict` is useful when saving and loading trained PyTorch models, for example
- `state_dict` 在保存和加载训练好的 PyTorch 模型时很有用，例如
- In this particular case, saving and loading the `mask` is maybe not super useful, because it remains unchanged during training; so, for demonstration purposes, let's assume it was modified where all `1`'s were changed to `2`'s:
- 在这个特定的情况下，保存和加载 `mask` 可能不是特别有用，因为它在训练过程中保持不变；所以，出于演示目的，让我们假设它被修改了，所有的 `1` 都被改为 `2`：

In [None]:
ca_with_buffer.mask[ca_with_buffer.mask == 1.] = 2.
ca_with_buffer.mask

tensor([[0., 2., 2., 2., 2., 2.],
        [0., 0., 2., 2., 2., 2.],
        [0., 0., 0., 2., 2., 2.],
        [0., 0., 0., 0., 2., 2.],
        [0., 0., 0., 0., 0., 2.],
        [0., 0., 0., 0., 0., 0.]], device='cuda:0')

- Then, if we save and load the model, we can see that the mask is restored with the modified value
- 然后，如果我们保存并加载模型，我们可以看到掩码以修改后的值被恢复：

In [None]:
torch.save(ca_with_buffer.state_dict(), "model.pth")

new_ca_with_buffer = CausalAttentionWithBuffer(d_in, d_out, context_length, 0.0)
new_ca_with_buffer.load_state_dict(torch.load("model.pth"))

new_ca_with_buffer.mask

tensor([[0., 2., 2., 2., 2., 2.],
        [0., 0., 2., 2., 2., 2.],
        [0., 0., 0., 2., 2., 2.],
        [0., 0., 0., 0., 2., 2.],
        [0., 0., 0., 0., 0., 2.],
        [0., 0., 0., 0., 0., 0.]])

- This is not true if we don't use buffers:
- 如果我们不使用缓冲区，情况就不是这样了：

In [None]:
ca_without_buffer.mask[ca_without_buffer.mask == 1.] = 2.

torch.save(ca_without_buffer.state_dict(), "model.pth")

new_ca_without_buffer = CausalAttentionWithoutBuffers(d_in, d_out, context_length, 0.0)
new_ca_without_buffer.load_state_dict(torch.load("model.pth"))

new_ca_without_buffer.mask

tensor([[0., 1., 1., 1., 1., 1.],
        [0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1.],
        [0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0.]])