<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>


# 第 3: Coding 注意力机制 Mechanisms

Packages 那个 are being used in 这个 笔记本:

In [1]:
from importlib.metadata import version

print("torch version:", version("torch"))

torch version: 2.4.0


- 这个 第 covers 注意力机制 mechanisms, 这个 engine of LLMs:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/01.webp?123" width="500px">

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/02.webp" width="600px">

## 3.1 这个 problem with modeling long sequences

- No 代码 in 这个 section
- Translating 一个 text word by word isn't feasible due to 这个 differences in grammatical structures between 这个 source 和 目标 languages:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/03.webp" width="400px">

- Prior to 这个 介绍 of Transformer models, encoder-decoder RNNs were commonly used for machine translation tasks
- In 这个 setup, 这个 encoder processes 一个 sequence of tokens from 这个 source language, using 一个 hidden state—一个 kind of intermediate 层 within 这个 神经网络—to 生成 一个 condensed representation of 这个 entire 输入 sequence:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/04.webp" width="500px">

## 3.2 Capturing data dependencies with 注意力机制 mechanisms

- No 代码 in 这个 section
- Through 一个 注意力机制 mechanism, 这个 text-generating decoder segment of 这个 network is capable of selectively accessing all 输入 tokens, implying 那个 certain 输入 tokens hold more significance than others in 这个 generation of 一个 specific 输出 词元:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/05.webp" width="500px">

- Self-注意力机制 in transformers is 一个 technique designed to enhance 输入 representations by enabling each position in 一个 sequence to engage with 和 determine 这个 relevance of every other position within 这个 same sequence

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/06.webp" width="300px">

## 3.3 Attending to different parts of 这个 输入 with self-注意力机制

### 3.3.1 一个 simple self-注意力机制 mechanism without trainable weights

- 这个 section explains 一个 very simplified variant of self-注意力机制, 哪个 does not contain any trainable weights
- 这个 is purely for illustration purposes 和 NOT 这个 注意力机制 mechanism 那个 is used in transformers
- 这个 接下来 section, section 3.3.2, will extend 这个 simple 注意力机制 mechanism to 实现 这个 real self-注意力机制 mechanism
- Suppose 我们 are given 一个 输入 sequence $x^{(1)}$ to $x^{(T)}$
  - 这个 输入 is 一个 text (for 示例, 一个 sentence like "Your journey starts with one step") 那个 has already been converted into 词元 embeddings as described in 第 2
  - For instance, $x^{(1)}$ is 一个 d-dimensional vector representing 这个 word "Your", 和 so forth
- **Goal:** 计算 context vectors $z^{(i)}$ for each 输入 sequence element $x^{(i)}$ in $x^{(1)}$ to $x^{(T)}$ (哪里 $z$ 和 $x$ have 这个 same dimension)
    - 一个 context vector $z^{(i)}$ is 一个 weighted sum over 这个 inputs $x^{(1)}$ to $x^{(T)}$
    - 这个 context vector is "context"-specific to 一个 certain 输入
      - Instead of $x^{(i)}$ as 一个 placeholder for 一个 arbitrary 输入 词元, 让我们 consider 这个 second 输入, $x^{(2)}$
      - 和 to continue with 一个 concrete 示例, instead of 这个 placeholder $z^{(i)}$, 我们 consider 这个 second 输出 context vector, $z^{(2)}$
      - 这个 second context vector, $z^{(2)}$, is 一个 weighted sum over all inputs $x^{(1)}$ to $x^{(T)}$ weighted with respect to 这个 second 输入 element, $x^{(2)}$
      - 这个 注意力机制 weights are 这个 weights 那个 determine 如何 much each of 这个 输入 elements contributes to 这个 weighted sum 当 computing $z^{(2)}$
      - In short, think of $z^{(2)}$ as 一个 modified version of $x^{(2)}$ 那个 also incorporates information about all other 输入 elements 那个 are relevant to 一个 given task at hand

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/07.webp" width="400px">

- (Please note 那个 这个 numbers in 这个 figure are truncated to one
digit after 这个 decimal point to reduce visual clutter; similarly, other figures may also contain truncated values)

- By convention, 这个 unnormalized 注意力机制 weights are referred to as **"注意力机制 scores"** whereas 这个 normalized 注意力机制 scores, 哪个 sum to 1, are referred to as **"注意力机制 weights"**


- 这个 代码 below walks through 这个 figure above step by step

<br>

- **Step 1:** 计算 unnormalized 注意力机制 scores $\omega$
- Suppose 我们 使用 这个 second 输入 词元 as 这个 query, 那个 is, $q^{(2)} = x^{(2)}$, 我们 计算 这个 unnormalized 注意力机制 scores via dot products:
    - $\omega_{21} = x^{(1)} q^{(2)\top}$
    - $\omega_{22} = x^{(2)} q^{(2)\top}$
    - $\omega_{23} = x^{(3)} q^{(2)\top}$
    - ...
    - $\omega_{2T} = x^{(T)} q^{(2)\top}$
- Above, $\omega$ is 这个 Greek letter "omega" used to symbolize 这个 unnormalized 注意力机制 scores
    - 这个 subscript "21" in $\omega_{21}$ means 那个 输入 sequence element 2 was used as 一个 query against 输入 sequence element 1

- Suppose 我们 have 这个 following 输入 sentence 那个 is already embedded in 3-dimensional vectors as described in 第 3 (我们 使用 一个 very small 嵌入 dimension 这里 for illustration purposes, so 那个 它 fits onto 这个 page without line breaks):

In [2]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

- (In 这个 book, 我们 follow 这个 common 机器学习 和 深度学习 convention 哪里 训练 examples are represented as rows 和 特征 values as columns; in 这个 case of 这个 tensor shown above, each row represents 一个 word, 和 each column represents 一个 嵌入 dimension)

- 这个 primary objective of 这个 section is to demonstrate 如何 这个 context vector $z^{(2)}$
  is calculated using 这个 second 输入 sequence, $x^{(2)}$, as 一个 query

- 这个 figure depicts 这个 initial step in 这个 处理, 哪个 involves calculating 这个 注意力机制 scores ω between $x^{(2)}$
  和 all other 输入 elements through 一个 dot product operation

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/08.webp" width="400px">

- 我们 使用 输入 sequence element 2, $x^{(2)}$, as 一个 示例 to 计算 context vector $z^{(2)}$; later in 这个 section, 我们 will generalize 这个 to 计算 all context vectors.
- 这个 首先 step is to 计算 这个 unnormalized 注意力机制 scores by computing 这个 dot product between 这个 query $x^{(2)}$ 和 all other 输入 tokens:

In [3]:
query = inputs[1]  # 2nd 输入 词元 is 这个 query

attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary 这里 since they are 1-dim vectors)

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


- Side note: 一个 dot product is essentially 一个 shorthand for multiplying two vectors elements-wise 和 summing 这个 resulting products:

In [4]:
res = 0.

for idx, element in enumerate(inputs[0]):
    res += inputs[0][idx] * query[idx]

print(res)
print(torch.dot(inputs[0], query))

tensor(0.9544)
tensor(0.9544)


- **Step 2:** normalize 这个 unnormalized 注意力机制 scores ("omegas", $\omega$) so 那个 they sum up to 1
- 这里 is 一个 simple way to normalize 这个 unnormalized 注意力机制 scores to sum up to 1 (一个 convention, useful for interpretation, 和 important for 训练 stability):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/09.webp" width="500px">

In [5]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


- However, in practice, using 这个 softmax 函数 for 归一化, 哪个 is better at handling extreme values 和 has more desirable 梯度 properties during 训练, is common 和 recommended.
- 这里's 一个 naive 实现 of 一个 softmax 函数 for scaling, 哪个 also normalizes 这个 vector elements such 那个 they sum up to 1:

In [6]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)

print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


- 这个 naive 实现 above can suffer from numerical instability issues for large 或者 small 输入 values due to overflow 和 underflow issues
- Hence, in practice, 它's recommended to 使用 这个 PyTorch 实现 of softmax instead, 哪个 has been highly optimized for 性能:

In [7]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


- **Step 3**: 计算 这个 context vector $z^{(2)}$ by multiplying 这个 embedded 输入 tokens, $x^{(i)}$ with 这个 注意力机制 weights 和 sum 这个 resulting vectors:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/10.webp" width="500px">

In [8]:
query = inputs[1] # 2nd 输入 词元 is 这个 query

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i

print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


### 3.3.2 Computing 注意力机制 weights for all 输入 tokens

#### Generalize to all 输入 sequence tokens:

- Above, 我们 computed 这个 注意力机制 weights 和 context vector for 输入 2 (as illustrated in 这个 highlighted row in 这个 figure below)
- 接下来, 我们 are generalizing 这个 computation to 计算 all 注意力机制 weights 和 context vectors

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/11.webp" width="400px">

- (Please note 那个 这个 numbers in 这个 figure are truncated to two
digits after 这个 decimal point to reduce visual clutter; 这个 values in each row should 添加 up to 1.0 或者 100%; similarly, digits in other figures are truncated)

- In self-注意力机制, 这个 处理 starts with 这个 calculation of 注意力机制 scores, 哪个 are subsequently normalized to derive 注意力机制 weights 那个 total 1
- These 注意力机制 weights are 然后 utilized to 生成 这个 context vectors through 一个 weighted summation of 这个 inputs

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/12.webp" width="400px">

- 应用 previous **step 1** to all pairwise elements to 计算 这个 unnormalized 注意力机制 score matrix:

In [9]:
attn_scores = torch.empty(6, 6)

for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)

print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- 我们 can achieve 这个 same as above more efficiently via matrix multiplication:

In [10]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- Similar to **step 2** previously, 我们 normalize each row so 那个 这个 values in each row sum to 1:

In [11]:
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


- Quick verification 那个 这个 values in each row indeed sum to 1:

In [12]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)

print("All row sums:", attn_weights.sum(dim=-1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


- 应用 previous **step 3** to 计算 all context vectors:

In [13]:
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


- As 一个 sanity 检查, 这个 previously computed context vector $z^{(2)} = [0.4419, 0.6515, 0.5683]$ can be found in 这个 2nd row in above: 

In [14]:
print("Previous 2nd context vector:", context_vec_2)

Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])


## 3.4 Implementing self-注意力机制 with trainable weights

- 一个 conceptual framework illustrating 如何 这个 self-注意力机制 mechanism developed in 这个 section integrates into 这个 overall narrative 和 structure of 这个 book 和 第

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/13.webp" width="400px">

### 3.4.1 Computing 这个 注意力机制 weights step by step

- In 这个 section, 我们 are implementing 这个 self-注意力机制 mechanism 那个 is used in 这个 original Transformer architecture, 这个 GPT models, 和 most other popular LLMs
- 这个 self-注意力机制 mechanism is also called "scaled dot-product 注意力机制"
- 这个 overall idea is similar to before:
  - 我们 want to 计算 context vectors as weighted sums over 这个 输入 vectors specific to 一个 certain 输入 element
  - For 这个 above, 我们 need 注意力机制 weights
- As 你 will see, 那里 are only slight differences compared to 这个 basic 注意力机制 mechanism introduced earlier:
  - 这个 most notable difference is 这个 介绍 of 权重 matrices 那个 are updated during 模型 训练
  - These trainable 权重 matrices are crucial so 那个 这个 模型 (specifically, 这个 注意力机制 模块 inside 这个 模型) can learn to 产生 "good" context vectors

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/14.webp" width="600px">

- Implementing 这个 self-注意力机制 mechanism step by step, 我们 will 开始 by introducing 这个 three 训练 权重 matrices $W_q$, $W_k$, 和 $W_v$
- These three matrices are used to project 这个 embedded 输入 tokens, $x^{(i)}$, into query, key, 和 value vectors via matrix multiplication:

  - Query vector: $q^{(i)} = x^{(i)}\,W_q $
  - Key vector: $k^{(i)} = x^{(i)}\,W_k $
  - Value vector: $v^{(i)} = x^{(i)}\,W_v $


- 这个 嵌入 dimensions of 这个 输入 $x$ 和 这个 query vector $q$ can be 这个 same 或者 different, depending on 这个 模型's design 和 specific 实现
- In GPT models, 这个 输入 和 输出 dimensions are usually 这个 same, 但是 for illustration purposes, to better follow 这个 computation, 我们 choose different 输入 和 输出 dimensions 这里:

In [15]:
x_2 = inputs[1] # second 输入 element
d_in = inputs.shape[1] # 这个 输入 嵌入 size, d=3
d_out = 2 # 这个 输出 嵌入 size, d=2

- Below, 我们 初始化 这个 three 权重 matrices; note 那个 我们 are setting `requires_grad=False` to reduce clutter in 这个 outputs for illustration purposes, 但是 如果 我们 were to 使用 这个 权重 matrices for 模型 训练, 我们 would 设置 `requires_grad=True` to 更新 these matrices during 模型 训练

In [16]:
torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

- 接下来 我们 计算 这个 query, key, 和 value vectors:

In [17]:
query_2 = x_2 @ W_query # _2 because 它's with respect to 这个 2nd 输入 element
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value

print(query_2)

tensor([0.4306, 1.4551])


- As 我们 can see below, 我们 successfully projected 这个 6 输入 tokens from 一个 3D onto 一个 2D 嵌入 space:

In [18]:
keys = inputs @ W_key 
values = inputs @ W_value

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


- In 这个 接下来 step, **step 2**, 我们 计算 这个 unnormalized 注意力机制 scores by computing 这个 dot product between 这个 query 和 each key vector:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/15.webp" width="600px">

In [19]:
keys_2 = keys[1] # Python starts index at 0
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(1.8524)


- Since 我们 have 6 inputs, 我们 have 6 注意力机制 scores for 这个 given query vector:

In [20]:
attn_scores_2 = query_2 @ keys.T # All 注意力机制 scores for given query
print(attn_scores_2)

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/16.webp" width="600px">

- 接下来, in **step 3**, 我们 计算 这个 注意力机制 weights (normalized 注意力机制 scores 那个 sum up to 1) using 这个 softmax 函数 我们 used earlier
- 这个 difference to earlier is 那个 我们 现在 scale 这个 注意力机制 scores by dividing them by 这个 square root of 这个 嵌入 dimension, $\sqrt{d_k}$ (i.e., `d_k**0.5`):

In [21]:
d_k = keys.shape[1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/17.webp" width="600px">

- In **step 4**, 我们 现在 计算 这个 context vector for 输入 query vector 2:

In [22]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([0.3061, 0.8210])


### 3.4.2 Implementing 一个 compact SelfAttention 类

- Putting 它 all together, 我们 can 实现 这个 self-注意力机制 mechanism as follows:

In [23]:
import torch.nn as nn

class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/18.webp" width="400px">

- 我们 can streamline 这个 实现 above using PyTorch's Linear layers, 哪个 are equivalent to 一个 matrix multiplication 如果 我们 disable 这个 偏置 units
- Another big advantage of using `nn.Linear` over our manual `nn.参数(torch.rand(...)` approach is 那个 `nn.Linear` has 一个 preferred 权重 initialization scheme, 哪个 leads to more stable 模型 训练

In [24]:
class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


- Note 那个 `SelfAttention_v1` 和 `SelfAttention_v2` give different outputs because they 使用 different initial weights for 这个 权重 matrices

## 3.5 Hiding future words with causal 注意力机制

- In causal 注意力机制, 这个 注意力机制 weights above 这个 diagonal are masked, ensuring 那个 for any given 输入, 这个 大语言模型 is unable to 利用 future tokens while calculating 这个 context vectors with 这个 注意力机制 权重

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/19.webp" width="400px">

### 3.5.1 Applying 一个 causal 注意力机制 mask

- In 这个 section, 我们 are converting 这个 previous self-注意力机制 mechanism into 一个 causal self-注意力机制 mechanism
- Causal self-注意力机制 ensures 那个 这个 模型's 预测 for 一个 certain position in 一个 sequence is only dependent on 这个 known outputs at previous positions, not on future positions
- In simpler words, 这个 ensures 那个 each 接下来 word 预测 should only depend on 这个 preceding words
- To achieve 这个, for each given 词元, 我们 mask out 这个 future tokens (这个 ones 那个 come after 这个 current 词元 in 这个 输入 text):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/20.webp" width="600px">

- To illustrate 和 实现 causal self-注意力机制, 让我们 work with 这个 注意力机制 scores 和 weights from 这个 previous section: 

In [25]:
# Reuse 这个 query 和 key 权重 matrices of 这个
# SelfAttention_v2 object from 这个 previous section for convenience
queries = sa_v2.W_query(inputs)
keys = sa_v2.W_key(inputs) 
attn_scores = queries @ keys.T

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


- 这个 simplest way to mask out future 注意力机制 weights is by creating 一个 mask via PyTorch's tril 函数 with elements below 这个 main diagonal (including 这个 diagonal itself) 设置 to 1 和 above 这个 main diagonal 设置 to 0:

In [26]:
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])


- 然后, 我们 can multiply 这个 注意力机制 weights with 这个 mask to zero out 这个 注意力机制 scores above 这个 diagonal:

In [27]:
masked_simple = attn_weights*mask_simple
print(masked_simple)

tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<MulBackward0>)


- However, 如果 这个 mask were applied after softmax, like above, 它 would disrupt 这个 probability distribution created by softmax
- Softmax ensures 那个 all 输出 values sum to 1
- Masking after softmax would require re-normalizing 这个 outputs to sum to 1 again, 哪个 complicates 这个 处理 和 might lead to unintended effects

- To make sure 那个 这个 rows sum to 1, 我们 can normalize 这个 注意力机制 weights as follows:

In [28]:
row_sums = masked_simple.sum(dim=-1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<DivBackward0>)


- While 我们 are technically done with coding 这个 causal 注意力机制 mechanism 现在, 让我们 briefly look at 一个 more efficient approach to achieve 这个 same as above
- So, instead of zeroing out 注意力机制 weights above 这个 diagonal 和 renormalizing 这个 results, 我们 can mask 这个 unnormalized 注意力机制 scores above 这个 diagonal with negative infinity before they enter 这个 softmax 函数:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/21.webp" width="450px">

In [29]:
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)

tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
        [0.4594, 0.1703, 0.1731,   -inf,   -inf,   -inf],
        [0.2642, 0.1024, 0.1036, 0.0186,   -inf,   -inf],
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,   -inf],
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
       grad_fn=<MaskedFillBackward0>)


- As 我们 can see below, 现在 这个 注意力机制 weights in each row correctly sum to 1 again:

In [30]:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


### 3.5.2 Masking additional 注意力机制 weights with dropout

- In addition, 我们 also 应用 dropout to reduce overfitting during 训练
- Dropout can be applied in several places:
  - for 示例, after computing 这个 注意力机制 weights;
  - 或者 after multiplying 这个 注意力机制 weights with 这个 value vectors
- 这里, 我们 will 应用 这个 dropout mask after computing 这个 注意力机制 weights because 它's more common

- Furthermore, in 这个 specific 示例, 我们 使用 一个 dropout rate of 50%, 哪个 means randomly masking out half of 这个 注意力机制 weights. (当 我们 train 这个 GPT 模型 later, 我们 will 使用 一个 lower dropout rate, such as 0.1 或者 0.2

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/22.webp" width="400px">

- 如果 我们 应用 一个 dropout rate of 0.5 (50%), 这个 non-dropped values will be scaled accordingly by 一个 factor of 1/0.5 = 2
- 这个 scaling is calculated by 这个 formula 1 / (1 - `dropout_rate`)

In [31]:
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) # dropout rate of 50%
example = torch.ones(6, 6) # 创建 一个 matrix of ones

print(dropout(example))

tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])


In [32]:
torch.manual_seed(123)
print(dropout(attn_weights))

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
       grad_fn=<MulBackward0>)


- Note 那个 这个 resulting dropout outputs may look different depending on your operating system; 你 can read more about 这个 inconsistency [这里 on 这个 PyTorch issue tracker](https://github.com/pytorch/pytorch/issues/121595)

### 3.5.3 Implementing 一个 compact causal self-注意力机制 类

- 现在, 我们 are ready to 实现 一个 working 实现 of self-注意力机制, including 这个 causal 和 dropout masks
- One more thing is to 实现 这个 代码 to handle batches consisting of more than one 输入 so 那个 our `CausalAttention` 类 supports 这个 batch outputs produced by 这个 数据加载器 我们 implemented in 第 2
- For simplicity, to simulate such batch 输入, 我们 duplicate 这个 输入 text 示例:

In [33]:
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape) # 2 inputs with 6 tokens each, 和 each 词元 has 嵌入 dimension 3

torch.Size([2, 6, 3])


In [34]:
class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        # For inputs 哪里 `num_tokens` exceeds `context_length`, 这个 will result in errors
        # in 这个 mask creation further below.
        # In practice, 这个 is not 一个 problem since 这个 大语言模型 (chapters 4-7) ensures 那个 inputs  
        # do not exceed `context_length` before reaching 这个 forward 方法. 
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases 哪里 这个 number of tokens in 这个 batch is smaller than 这个 supported context_size
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)

context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)

context_vecs = ca(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]], grad_fn=<UnsafeViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note 那个 dropout is only applied during 训练, not during 推理

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/23.webp" width="500px">

## 3.6 Extending single-head 注意力机制 to multi-head 注意力机制

### 3.6.1 Stacking multiple single-head 注意力机制 layers

- Below is 一个 summary of 这个 self-注意力机制 implemented previously (causal 和 dropout masks not shown for simplicity)

- 这个 is also called single-head 注意力机制:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/24.webp" width="400px">

- 我们 simply stack multiple single-head 注意力机制 modules to obtain 一个 multi-head 注意力机制 模块:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/25.webp" width="400px">

- 这个 main idea behind multi-head 注意力机制 is to 运行 这个 注意力机制 mechanism multiple times (in parallel) with different, learned linear projections. 这个 allows 这个 模型 to jointly attend to information from different representation subspaces at different positions.

In [35]:
class MultiHeadAttentionWrapper(nn.Module):

    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)


torch.manual_seed(123)

context_length = batch.shape[1] # 这个 is 这个 number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=2
)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]],

        [[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])


- In 这个 实现 above, 这个 嵌入 dimension is 4, because 我们 `d_out=2` as 这个 嵌入 dimension for 这个 key, query, 和 value vectors as well as 这个 context vector. 和 since 我们 have 2 注意力机制 heads, 我们 have 这个 输出 嵌入 dimension 2*2=4

### 3.6.2 Implementing multi-head 注意力机制 with 权重 splits

- While 这个 above is 一个 intuitive 和 fully functional 实现 of multi-head 注意力机制 (wrapping 这个 single-head 注意力机制 `CausalAttention` 实现 from earlier), 我们 can write 一个 stand-alone 类 called `MultiHeadAttention` to achieve 这个 same

- 我们 don't concatenate single 注意力机制 heads for 这个 stand-alone `MultiHeadAttention` 类
- Instead, 我们 创建 single W_query, W_key, 和 W_value 权重 matrices 和 然后 split those into individual matrices for each 注意力机制 head:

In [36]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce 这个 projection dim to match desired 输出 dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear 层 to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        # As in `CausalAttention`, for inputs 哪里 `num_tokens` exceeds `context_length`, 
        # 这个 will result in errors in 这个 mask creation further below. 
        # In practice, 这个 is not 一个 problem since 这个 大语言模型 (chapters 4-7) ensures 那个 inputs  
        # do not exceed `context_length` before reaching 这个 forwar

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # 我们 implicitly split 这个 matrix by adding 一个 `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # 计算 scaled dot-product 注意力机制 (aka self-注意力机制) with 一个 causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to 这个 number of tokens 和 converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # 使用 这个 mask to fill 注意力机制 scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, 哪里 self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]],

        [[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note 那个 这个 above is essentially 一个 rewritten version of `MultiHeadAttentionWrapper` 那个 is more efficient
- 这个 resulting 输出 looks 一个 bit different since 这个 random 权重 initializations differ, 但是 both are fully functional implementations 那个 can be used in 这个 GPT 类 我们 will 实现 in 这个 upcoming chapters
- Note 那个 in addition, 我们 added 一个 linear projection 层 (`self.out_proj `) to 这个 `MultiHeadAttention` 类 above. 这个 is simply 一个 linear transformation 那个 doesn't 改变 这个 dimensions. 它's 一个 standard convention to 使用 such 一个 projection 层 in 大语言模型 实现, 但是 它's not strictly necessary (recent research has shown 那个 它 can be removed without affecting 这个 modeling 性能; see 这个 further reading section at 这个 结束 of 这个 第)


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/26.webp" width="400px">

- Note 那个 如果 你 are interested in 一个 compact 和 efficient 实现 of 这个 above, 你 can also consider 这个 [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) 类 in PyTorch

- Since 这个 above 实现 may look 一个 bit complex at 首先 glance, 让我们 look at 什么 happens 当 executing `attn_scores = queries @ keys.transpose(2, 3)`:

In [37]:
# (b, num_heads, num_tokens, head_dim) = (1, 2, 3, 4)
a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],
                    [0.8993, 0.0390, 0.9268, 0.7388],
                    [0.7179, 0.7058, 0.9156, 0.4340]],

                   [[0.0772, 0.3565, 0.1479, 0.5331],
                    [0.4066, 0.2318, 0.4545, 0.9737],
                    [0.4606, 0.5159, 0.4220, 0.5786]]]])

print(a @ a.transpose(2, 3))

tensor([[[[1.3208, 1.1631, 1.2879],
          [1.1631, 2.2150, 1.8424],
          [1.2879, 1.8424, 2.0402]],

         [[0.4391, 0.7003, 0.5903],
          [0.7003, 1.3737, 1.0620],
          [0.5903, 1.0620, 0.9912]]]])


- In 这个 case, 这个 matrix multiplication 实现 in PyTorch will handle 这个 4-dimensional 输入 tensor so 那个 这个 matrix multiplication is carried out between 这个 2 last dimensions (num_tokens, head_dim) 和 然后 repeated for 这个 individual heads 

- For instance, 这个 following becomes 一个 more compact way to 计算 这个 matrix multiplication for each head separately:

In [38]:
first_head = a[0, 0, :, :]
first_res = first_head @ first_head.T
print("First head:\n", first_res)

second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)

First head:
 tensor([[1.3208, 1.1631, 1.2879],
        [1.1631, 2.2150, 1.8424],
        [1.2879, 1.8424, 2.0402]])

Second head:
 tensor([[0.4391, 0.7003, 0.5903],
        [0.7003, 1.3737, 1.0620],
        [0.5903, 1.0620, 0.9912]])


# Summary 和 takeaways

- See 这个 [./multihead-注意力机制.ipynb](./multihead-注意力机制.ipynb) 代码 笔记本, 哪个 is 一个 concise version of 这个 数据加载器 (第 2) plus 这个 multi-head 注意力机制 类 那个 我们 implemented in 这个 第 和 will need for 训练 这个 GPT 模型 in upcoming chapters
- 你 can find 这个 练习 解答 in [./练习-解答.ipynb](./练习-解答.ipynb)