> 1. 现代式语言模型，或者现代式人工智能最最核心的是 Transformer 架构，Transformer 架构最特色底层的计算机制是 Attention；
> 2. 在 Transformer 架构上，在 Attention 计算上花再多的时间探索都是值得的。

In [39]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

import torch
from torch import nn
import torch.nn.functional as F
torch.manual_seed(42)

<torch._C.Generator at 0x72694a27ac50>

## review GPT


- 重新 review GPT 的过程
    - input_ids: 1*1024, 一个（bs）长度为 1024 的 token ids
    - last_hidden_states: 1\*1024\*768
        - last layer hidden states of (transformer)
    - lm_logits: 1\*1024\*50257
        - lm head，将每一个位置上的 token 的 hidden state，映射到整个词表维度上的概率分布输出
        
- shift labels 与损失计算

    ```
    labels = labels.to(lm_logits.device)
    
    # Shift so that tokens < n predict n
    shift_logits = lm_logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    # Flatten the tokens
    loss_fct = CrossEntropyLoss()
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    ```

In [4]:
logits = [1, 2, 3, 4, 5]
labels = [1, 2, 3, 4, 5]
print(logits, logits[:-1])
print(labels, labels[1:])

[1, 2, 3, 4, 5] [1, 2, 3, 4]
[1, 2, 3, 4, 5] [2, 3, 4, 5]


### casual/decoder only 单向注意力的实现

- BERT：双向注意力（bidirectional self attention）

    $$
    \quad \text{Attention}(Q^{(n \times d_k)}, K^{(n \times d_k)}, V^{(n \times d_v)}) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V 
    $$

- GPT：单向因果注意力（causal self attention）

    $$
    \quad \text{Attention}(Q^{(n \times d_k)}, K^{(n \times d_k)}, V^{(n \times d_v)}) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}+ M\right)V
    $$

    - $M_{ij}=0, j\ge i$
    - $M_{ij}=1, j\leq i$
    
    $$
    M = \begin{pmatrix}
    1 & -\infty & -\infty & \cdots & -\infty \\
    1 & 1 & -\infty & \cdots & -\infty \\
    1 & 1 & 1 & \cdots & -\infty \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    1 & 1 & 1 & \cdots & 1
    \end{pmatrix}_{n\times n}
    $$

- T5：encoder 输出 K/V（取值相同），decoder 输出 Q，两者做 Cross attention

    $$
    \begin{split}
    \text{Encoder Self-Attention} &: \quad \text{Attention}(Q^{(n \times d_k)}, K^{(n \times d_k)}, V^{(n \times d_v)}) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\\
    \text{Decoder Masked Self-Attention} & : \quad \text{Attention}(Q^{(m \times d_k)}, K^{(m \times d_k)}, V^{(m \times d_v)}) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}+M\right)V \\
    \text{Cross-Attention} & : \quad \text{Attention}(Q^{(m \times d_k)}, K^{(n \times d_k)}, V^{(n \times d_v)}) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \\
    \end{split}
    $$


- modeling_gpt2.py
    - GPT2Attention._attn

```
if not self.is_cross_attention:
    # if only "normal" attention layer implements causal mask
    query_length, key_length = query.size(-2), key.size(-2)
    causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
    mask_value = torch.finfo(attn_weights.dtype).min
    # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
    # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
    mask_value = torch.full([], mask_value, dtype=attn_weights.dtype, device=attn_weights.device)
    attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)

if attention_mask is not None:
    # Apply the attention mask
    attn_weights = attn_weights + attention_mask
```

### CrossEntropyLoss

计算的角度
- labels 起到选择的作用
- ignore_index：过滤（-100）
    - PPL 计算的时候会用到

In [34]:
# -100，labels（token id） 提供一个选择器

# Example of target with class indices
loss = nn.CrossEntropyLoss()

In [35]:
# token-wise logits (transformer output)
input = torch.randn(3, 5, requires_grad=True)
input

tensor([[ 0.3367,  0.1288,  0.2345,  0.2303, -1.1229],
        [-0.1863,  2.2082, -0.6380,  0.4617,  0.2674],
        [ 0.5349,  0.8094,  1.1103, -1.6898, -0.9890]], requires_grad=True)

In [36]:
target = torch.empty(3, dtype=torch.long).random_(5)
target

tensor([0, 4, 3])

In [38]:
# 0.3367, 0.2674, -1.6898
output = loss(input, target)
output

tensor(2.4607, grad_fn=<NllLossBackward0>)

In [40]:
# -1.3472, -2.3242, -3.7108
F.log_softmax(input, dim=-1)

tensor([[-1.3472, -1.5551, -1.4494, -1.4535, -2.8067],
        [-2.7779, -0.3834, -3.2296, -2.1299, -2.3242],
        [-1.4860, -1.2116, -0.9107, -3.7108, -3.0099]],
       grad_fn=<LogSoftmaxBackward0>)

In [41]:
(-1.3472 + (-2.3242) + (-3.7108))/3

-2.460733333333333

In [43]:
target[-1] = -100
target

tensor([   0,    4, -100])

In [44]:
loss(input, target)

tensor(1.8357, grad_fn=<NllLossBackward0>)

In [45]:
(-1.3472 + (-2.3242))/2

-1.8356999999999999

## Training & Inference/Generate

- llama2/3 inference code: autoregressive, token by token generation
    - https://github.com/meta-llama/llama3/blob/main/llama/generation.py#L179-L192C13
- training 的时候，因为有 casual mask（下三角矩阵的存在），等价于 autoregressive，token by token
- 计算 PPL （语言模型训练好坏的一个指标）的过程就是已有文本的测试集，可以用 casual mask的方式实现自注意力，实现 autoregressive，token by token

In [3]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# 初始化模型和tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2').to('cuda')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# 输入序列
input_text = "The quick brown fox jumps over the lazy dog"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

In [15]:
input_ids.shape

torch.Size([1, 9])

In [23]:
outputs = model(input_ids.to('cuda'), )
logits = outputs.logits
logits.shape, logits[:, 1:-1, :]

(torch.Size([1, 9, 50257]),
 tensor([[[-62.3139, -61.5645, -66.4938,  ..., -68.1286, -68.3228, -63.5829],
          [-66.3240, -66.7452, -72.1618,  ..., -75.1955, -73.4650, -68.1786],
          [-88.2910, -88.7236, -93.4422,  ..., -98.6211, -90.6379, -90.9913],
          ...,
          [-80.7563, -82.8596, -87.4034,  ..., -91.0716, -89.5648, -84.5701],
          [-94.8247, -94.5054, -97.7886,  ..., -97.1508, -98.4995, -96.5095],
          [-88.8787, -87.6110, -92.3262,  ..., -95.8310, -93.5163, -91.9581]]],
        device='cuda:0', grad_fn=<SliceBackward0>))

In [18]:
# 逐步生成每个 token，并输出每一步的 logits
generated_logits = []

# 从第一个 token 开始逐步生成
for i in range(1, input_ids.size(1)):
    step_input_ids = input_ids[:, :i]  # 当前步骤的输入序列
    outputs = model(step_input_ids.to('cuda'))
    logits = outputs.logits
    next_token_logits = logits[:, -1, :]  # 获取最后一个token的logits
    generated_logits.append(next_token_logits)

generated_logits = torch.stack(generated_logits, dim=1)[:, :, :]

In [24]:
generated_logits.shape, generated_logits[:, 1:, :]

(torch.Size([1, 8, 50257]),
 tensor([[[-62.3139, -61.5645, -66.4938,  ..., -68.1286, -68.3228, -63.5829],
          [-66.3240, -66.7452, -72.1618,  ..., -75.1955, -73.4651, -68.1786],
          [-88.2909, -88.7236, -93.4422,  ..., -98.6211, -90.6378, -90.9913],
          ...,
          [-80.7563, -82.8596, -87.4034,  ..., -91.0716, -89.5648, -84.5701],
          [-94.8247, -94.5054, -97.7886,  ..., -97.1508, -98.4995, -96.5095],
          [-88.8787, -87.6110, -92.3262,  ..., -95.8310, -93.5163, -91.9581]]],
        device='cuda:0', grad_fn=<SliceBackward0>))

## PPL 指标的计算

In [46]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from datasets import load_dataset
from tqdm import tqdm

stride < seq_len: 刻画着一种 overlap，通过对 overlap 内的 label 置为 -100，避免重复计算；

- [0, 1024): 
- [512, 1024+512)：区间长度是 1024，计算 CrossEntropy loss 的 trg_len 
    - trg_len (计算 CrossEntropy loss): 512
- [1024, 1024+1024)
    - trg_len (计算 CrossEntropy loss): 512
- [1024+512, 1024+1024+512)

In [47]:
test_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")

model_id = "openai-community/gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id).to('cuda')
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
encodings = tokenizer("\n\n".join(test_dataset["text"]), return_tensors="pt")


max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0

for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to('cuda')
    # print('input_ids', input_ids)

    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    # print(begin_loc, end_loc, trg_len, prev_end_loc)

    # assert torch.allclose(input_ids, target_ids), (input_ids.shape, target_ids.shape)

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

print(torch.exp(torch.stack(nlls).mean()))

Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors
100%|█████████▉| 560/562 [00:07<00:00, 73.00it/s]

tensor(25.1880, device='cuda:0')



