# Transformer 架构

<p align="center">
<img src="architecture.png" width="300">
<img src="attn.png" width="200">
<img src="mh_attn.png" width="200">
</p>

In [None]:
import pandas as pd
import numpy as np
import spacy

Inputs = "我爱你"

In [None]:
inputs = "我爱你"

## 步骤一: Input Embedding

1. Tokenlization 分词，生成token = ["我", "喜欢", "苹果"]
2. 从Embedding中按token获取每个token的空间向量

Token 将文本转换为数字，每个token一个向量(1, 512)
Inputs = "我喜欢你"
"我" = [1, 0, 0, ..., 0](1, 10000)
"爱" = [0, 1, 0, ..., 0](1, 10000)
"你" = [0, 0, 1, ..., 0](1, 10000)
Embedd Space = (10000, 512)  # d_model = 512
![Alt text](image-8.png)

In [None]:
nlp = spacy.load('zh_core_web_sm')
# print(nlp.pipeline)
print(len(nlp.vocab.vectors.keys()))
doc = nlp(inputs)
print(doc)

In [None]:
doc[0].vector.shape

## 步骤二: Positional Encoding

1. sin()偶数位置， cos()奇数位置
   ![Alt text](image-9.png)

A word in different sentences can have different meanings.

In [None]:
emb_dim = 10
dics = {}
for token in doc:
    dics[token.text] = token.vector[:emb_dim]
dics

In [None]:
X = pd.DataFrame(dics)
X.T

## 步骤三: Multi-Head Attention

This is the main block where the magic happens.

Input (n, d_model)
Mq\Mk\Mv (d_model, d_model)
- Q = Input x Mq (n, d_model)
- K = Input x Mk (n, d_model)
- V = Input x Mv (n, d_model)

Weigth Matrix: Mq, Mk, Mv can be trained by neural network

Multi-Head Q / N-Head

Scale = 1 / sqr(d_model)
A = Q @ Kt * Scale 

![Alt text](image-10.png)

In [None]:
d_model = 6
Wq = np.random.randn(emb_dim, d_model)
Wk = np.random.randn(emb_dim, d_model)
Wv = np.random.randn(emb_dim, d_model)
Wq

In [None]:
Q = X.T @ Wq
K = X.T @ Wk
V = X.T @ Wv

In [None]:
df_QK = Q @ K.T / np.sqrt(d_model)
df_QK

In [None]:
for i in range(len(df_QK)):
    exp_v = np.exp(df_QK.iloc[i])
    softmax = exp_v / np.sum(exp_v)
    df_QK.iloc[i] = softmax

df_QK

In [None]:
V
attention = df_QK @ V
attention

## 步骤四: Add & Norm


## 步骤五: Decoder

In the paper, Attention is All You Need, this decoder was used for sentence translation (say from English to French). So the encoder will take in the English sentence, and the decoder will translate it to French. 

![Alt text](image-11.png)

导入必要的库

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

Multi-Head Attention

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

Position-wise Feed-Forward Networks

In [None]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

Positional Encoding

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

Encoder Layer

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

Decoder Layer

In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

Transformer Model

In [None]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

准备数据：Preparing Sample Data

In [None]:
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

# Generate random sample data
src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)

In [3]:
import torch
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(DEVICE)

cpu


训练模型：Training the Model

In [None]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

# Save the entire model
torch.save(Transformer, "model_before_train.pt")

# Sets the transformer model to training mode, enabling behaviors like dropout that only apply during training
transformer.train()

for epoch in range(100):
    optimizer.zero_grad()
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

# Save the entire model
torch.save(Transformer, "model_after_train.pt")


模型评估: Transformer Model Performance Evaluation

In [55]:
# Puts the transformer model in evaluation mode
transformer.eval()

# Generate random sample validation data
#val_src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
#val_tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
val_src_data = src_data
val_tgt_data = tgt_data
with torch.no_grad():
    val_output = transformer(val_src_data, val_tgt_data[:, :-1])
    val_loss = criterion(val_output.contiguous().view(-1, tgt_vocab_size), val_tgt_data[:, 1:].contiguous().view(-1))
    print(f"Validation Loss: {val_loss.item()}")

Validation Loss: 2.0779969692230225


模型参数

In [None]:
# Loop through modules and print parameter details
for name, param in transformer.named_parameters():
  print(f"Layer Name: {name} Parameter Shape: {param.size()}")
  # Print specific values if needed (e.g., first few elements)
  # print(f"Parameter Values: {param[:2]}")  # Print first two elements

# output = transformer(src_data, tgt_data[:, :-1])

真实数据

In [None]:
import pandas as pd
# Load the data from the path
data_path = "datacamp_workspace_export_2022-08-08 07_56_40.csv"
news_data = pd.read_csv(data_path, error_bad_lines=False)




# Show data information
news_data.info()

https://colab.research.google.com/drive/1OJ6y1xc4HKqSw7qJdptEvkY_WJFKJLUY?usp=sharing

## Transformer的缺点

### 如何计算 KV Cache

b = batch size
s = 输入序列的长度
n = 输出序列的长度
l = 模型的深度
h = 维度

以FP16来保存KV Cache, 那么峰值显存占用大小为 b * (s + n) * h * l * 2 * 2 = 4blh(s+n)
(第一个2表示K/V cache, 第二个2表示FP16占用2个Bytes)

以GPT3(175B)为例，对比KV cache与模型参数占用显存的大小。GPT3模型的weight占用显存为350GB(FP16)，层数l为96， 维度h为12888

|batch size |	s+n	| KV cache(GB)	| KV cache/weight |
|:-- |:--|:--|:--|
|4          |	4096|	75.5|	0.22|
|16         |	4096|	302|	0.86|
|64         |	4096|	1208|	3.45|

1. 总体趋势上LLM 的窗口长度在不断增大，因此就出现一组主要矛盾，即：对不断增长的 LLM 的窗口长度的需要与有限的 GPU 显存之间的矛盾。因此优化 KV cache 非常必要。OpenAI API场景，API最烧钱的是输入而非输出，输入包括prefill prompt 和conversation，长度动辄数十K token。虽说每输入token比每输出token便宜，但能够降低kv重新计算的开销，无论是硬件资源门槛，还是模型推理降本，都有着极为积极的作用。
2. 对于消费级显卡这种性价比较高的显卡而言，显存容量相对较小，KV cache从一定程度上降低了模型的batch size，因而KV cache优化在工程落地中更显重要。
3. sora/sd3等文生视频或者文生图的模型，纷纷放弃u-net架构，转而支持DIF（diffusion transformer）架构。对此类AIGC模型而言， KV cache同样能起到类似LLM上的加速效果。


Reference
- <https://zhuanlan.zhihu.com/p/685853516>