<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# 第三章练习解答

In [3]:
from importlib.metadata import version

import torch
print("torch version:", version("torch"))

torch version: 2.4.0


# 练习 3.1

In [4]:
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

d_in, d_out = 3, 2

In [6]:
import torch.nn as nn


class SelfAttention_v1(nn.Module):
    def __init__(self, d_in: int, d_out: int) -> None:
        super().__init__()
        self.d_out = d_out
        self.W_q = nn.Parameter(torch.randn(d_in, d_out))
        self.W_k = nn.Parameter(torch.randn(d_in, d_out))
        self.W_v = nn.Parameter(torch.randn(d_in, d_out))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        q = x @ self.W_q
        k = x @ self.W_k
        v = x @ self.W_v

        attn_scores = q @ k.T
        attn_weights = torch.softmax(attn_scores / k.shape[-1] ** 0.5, dim=-1)
        context_vector = attn_weights @ v
        return context_vector


torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)

In [12]:
class SelfAttention_v2(nn.Module):
    def __init__(self, d_in: int, d_out: int) -> None:
        super().__init__()
        self.d_out = d_out
        self.W_q = nn.Linear(d_in, d_out, bias=False)
        self.W_k = nn.Linear(d_in, d_out, bias=False)
        self.W_v = nn.Linear(d_in, d_out, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        q = self.W_q(x)
        k = self.W_k(x)
        v = self.W_v(x)
        
        attn_scores = q @ k.T
        attn_weights = torch.softmax(attn_scores / k.shape[-1] ** 0.5, dim=-1)
        context_vector = attn_weights @ v
        return context_vector


torch.manual_seed(123)
sa_v2 = SelfAttention_v2(d_in, d_out)

In [13]:
sa_v1.W_q = torch.nn.Parameter(sa_v2.W_q.weight.T)
sa_v1.W_k = torch.nn.Parameter(sa_v2.W_k.weight.T)
sa_v1.W_v = torch.nn.Parameter(sa_v2.W_v.weight.T)

In [14]:
sa_v1(inputs)

tensor([[-0.5337, -0.1051],
        [-0.5323, -0.1080],
        [-0.5323, -0.1079],
        [-0.5297, -0.1076],
        [-0.5311, -0.1066],
        [-0.5299, -0.1081]], grad_fn=<MmBackward0>)

In [15]:
sa_v2(inputs)

tensor([[-0.5337, -0.1051],
        [-0.5323, -0.1080],
        [-0.5323, -0.1079],
        [-0.5297, -0.1076],
        [-0.5311, -0.1066],
        [-0.5299, -0.1081]], grad_fn=<MmBackward0>)

# 练习 3.2

在使用多头注意力时，如当头数为2时，如果想要输出的维度为2，应该将d_out设置为1

```python
torch.manual_seed(123)

d_out = 1
mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
```

```
tensor([[[-9.1476e-02,  3.4164e-02],
         [-2.6796e-01, -1.3427e-03],
         [-4.8421e-01, -4.8909e-02],
         [-6.4808e-01, -1.0625e-01],
         [-8.8380e-01, -1.7140e-01],
         [-1.4744e+00, -3.4327e-01]],

        [[-9.1476e-02,  3.4164e-02],
         [-2.6796e-01, -1.3427e-03],
         [-4.8421e-01, -4.8909e-02],
         [-6.4808e-01, -1.0625e-01],
         [-8.8380e-01, -1.7140e-01],
         [-1.4744e+00, -3.4327e-01]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])
```

# 练习 3.3

```python
context_length = 1024
d_in, d_out = 768, 768
num_heads = 12

mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads)
```

模型的参数量计算函数如下：

```python
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(mha)
```

```
2360064  # (2.36 M)
```

GPT-2 模型总共有 117M 个参数，但可以看到一个MHA有 2.36M 个参数，它的大部分参数并不在多头注意力模块本身中