- https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side
- decoder-only => left padding
    - batch inputs 是会存在 padding 的需求，因为要组织成一个结构化的 tensor
    - padding：right，右侧填充，左侧对齐；left：左侧填充，右侧对齐；
    - 不仅要设置为 left padding，而且要在 model.generate 的时候要传入 `attention_mask`（不只有 `input_ids`）
    - 使用左填充可以将实际数据对齐到右侧，方便模型从左到右处理序列。
    - This is because the output is a continuation of the input prompt -- there would be gaps in the output without left padding.
- 如果 decoder-only 在generate时，tokenizer.padding_side 被设置为 `right`，Transformer 代码会报警告

    ```
    A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
    ```
- 位置编码（position ids）与 attention mask
    - 绝对位置编码，相对位置编码；
        - gpt2 是**绝对位置**编码（https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py#L1246C1-L1247C62）
        ```
        position_ids = attention_mask.long().cumsum(-1) - 1
        position_ids.masked_fill_(attention_mask == 0, 1)
        ```
    - attention_mask
        - GPT2 like decoder only 的 language model，在 generation 的时候，如果是 right padding（右侧padding，左侧对齐），基本都是有问题的，核心在于这种 autoregressive model 在 generation 的时候，是用的 current last hidden state 生成的，而在 input + pads => output 的时候，第一个 output 会用到最后一个 pad 的 hidden state。跟 attention_mask 的关系不大，attention mask 不会做截断，只是在计算 attention weight （softmax之前）+ (-inf)，使得 pad tokens 失效。

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, set_seed

# 设置随机种子以确保可重复性
seed = 42
set_seed(seed)
torch.manual_seed(seed)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 加载预训练的 GPT-2 模型和 tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

In [None]:
tokenizer.padding_side, tokenizer.pad_token

In [None]:
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token_id

In [None]:
# 定义输入句子
input_text = "I love you"

In [None]:
# 编码输入句子
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
input_ids

## `model.generate`

`# decoder-only models should use left-padding for generation`
- inputs
    - input_ids
    - attention_mask
        - position_ids
- padding_right 判断的标识
    - `torch.sum(input_tensors[:, -1] == generation_config.pad_token_id) > 0`
- wte, wpe 都是正常算
    - `attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min`
        - pad => -inf
        - 非pad => 0
    - 在算 attention 的时候，attn_weights = attn_weights + attention_mask

In [49]:
input_ids = torch.cat([torch.full((1, padding_length), padding_token_id).to(device), input_ids], dim=1)

In [34]:
left_attention_mask = torch.cat([torch.zeros((1, padding_length)), torch.ones(input_ids.shape)], dim=1).to(device)
left_attention_mask

tensor([[0., 0., 1., 1., 1.]], device='cuda:0')

In [36]:
position_ids = left_attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(left_attention_mask == 0, 1)
position_ids

tensor([[1, 1, 0, 1, 2]], device='cuda:0')

In [50]:
inputs_embeds = model.transformer.wte(input_ids)
position_embeds = model.transformer.wpe(position_ids)

In [39]:
# model.transformer.wpe(torch.tensor([1]).to(device))

In [56]:
hidden_states = inputs_embeds + position_embeds
hidden_states = model.transformer.drop(hidden_states)
# seqlen 的前两个位置是 pad_token_id
hidden_states.shape

torch.Size([1, 5, 768])

In [59]:
# 编码输入句子
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
input_ids = torch.cat([input_ids, torch.full((1, padding_length), padding_token_id).to(device)], dim=1)
right_attention_mask = torch.cat([torch.ones(input_ids.shape), torch.zeros((1, padding_length))], dim=1).to(device)
right_attention_mask

tensor([[1., 1., 1., 1., 1., 0., 0.]], device='cuda:0')

In [60]:
position_ids = right_attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(right_attention_mask == 0, 1)
position_ids

tensor([[0, 1, 2, 3, 4, 1, 1]], device='cuda:0')

## coding

### no_padding

In [10]:
outputs_no_padding = model.generate(input_ids, max_length=input_ids.size(1) + 5, do_sample=False)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [12]:
decoded_no_padding = tokenizer.decode(outputs_no_padding[0], skip_special_tokens=True)
decoded_no_padding

'I love you, and I love you'

### left padding no attention mask

In [14]:
padding_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
padding_length = 2
left_padded_input_ids = torch.cat([torch.full((1, padding_length), padding_token_id).to(device), input_ids], dim=1)
left_padded_input_ids

tensor([[50256, 50256,    40,  1842,   345]], device='cuda:0')

In [15]:
outputs_left_padding = model.generate(left_padded_input_ids, 
                                      max_length=left_padded_input_ids.size(1) + 5, do_sample=False)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [17]:
tokenizer.decode(outputs_left_padding[0], skip_special_tokens=True)

'I love you.\n\nI love'

### left padding with attention mask

In [18]:
left_attention_mask = torch.cat([torch.zeros((1, padding_length)), torch.ones(input_ids.shape)], dim=1).to(device)
left_attention_mask

tensor([[0., 0., 1., 1., 1.]], device='cuda:0')

In [19]:
outputs_left_padding = model.generate(left_padded_input_ids, attention_mask=left_attention_mask, 
                                      max_length=left_padded_input_ids.size(1) + 5, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [20]:
tokenizer.decode(outputs_left_padding[0], skip_special_tokens=True)

'I love you, and I love you'

### right padding with attention mask

In [22]:
right_padded_input_ids = torch.cat([input_ids, torch.full((1, padding_length), padding_token_id).to(device)], dim=1)
right_padded_input_ids

tensor([[   40,  1842,   345, 50256, 50256]], device='cuda:0')

In [23]:
right_attention_mask = torch.cat([torch.ones(input_ids.shape), torch.zeros((1, padding_length))], dim=1).to(device)

In [24]:
outputs_right_padding = model.generate(right_padded_input_ids, attention_mask=right_attention_mask, 
                                       max_length=right_padded_input_ids.size(1) + 5, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


In [25]:
tokenizer.decode(outputs_right_padding[0], skip_special_tokens=True)

'I love youThe best thing about this'