# 搞清基本构成
- tokenizer
- Embedding
- Position Embedding
- Block
    - RMSNorm
    - FFN
- decoder

![img](../images/LLM-structure.png)

接下来自底向上进行复现即可

## Embedding
主要利用的是nn.Embedding即可

In [1]:
#假设tokenizer处理后，得到的是(bs,seqlen,vocab_size)的tensor
import torch
from torch import nn
class Embedding(nn.Module):
    def __init__(self,vocab_size,embed_dim):
        super(Embedding,self).__init__()
        self.embedding=nn.Embedding(vocab_size,embed_dim)
    def forward(self,x):
        return self.embedding(x)

#进行测试
#输入的tensor为（2，3，64）
#输出的tensor为（2，3，8）
# vocab_size=64
# seqlen=3
# embed_dim=8
# bs=2
# input_ids=torch.randint(0,vocab_size,(bs,seqlen))
# embedding=Embedding(vocab_size=vocab_size,embed_dim=embed_dim)
# output=embedding(input_ids)
# print("input_ids:", input_ids)
# print("output:", output)
# print("input_ids shape:", input_ids.shape)
# print("output shape:", output.shape)


## RotaryEmbedding
旋转编码，需要参考公式和模型定义
### Rotary Position Embedding, RoPE

旋转位置编码是一种能将相对位置信息集成到 self-attention 中, 进而提升 transformer 架构性能的位置编码方式, 和绝对位置编码相比, RoPE 具有很好的外推性, 是目前的主流位置编码方式.

外推性的解释, 通俗来说就是训练的时候限制了 512 的上下文长度，那么推理时如果面对超过该长度的文本，LLM 可能无法正确处理.

- **绝对位置编码**

绝对位置编码是早期 Transformer 架构采用的绝对位置编码方案，及那个每个位置映射为固定的向量表示.

$$f_{t:t\in\{q,k,v\}}(\boldsymbol{x}_i,i)=\boldsymbol{W}_{t:t\in\{q,k,v\}}(\boldsymbol{x}_i+\boldsymbol{p}_i)$$

其中编码向量 $p_i$ 的计算使用如下公式：

$$\boldsymbol{p}_{i,2t}=\sin\left(k/1000^{2t/d}\right), \boldsymbol{p}_{i,2t+1}=\cos\left(k/1000^{2t/d}\right)$$

正如其名，绝对位置编码只考虑了输入序列中的绝对位置关系，对于 token 之间的相对信息则没有纳入考虑.

- **旋转位置编码**

假定 query 和 key 的内积操作可以被函数 g 表示，该函数 g 的输入是词嵌入向量 $x_m, x_n$ 和它们之间的相对位置 $m-n$:

$$<f_q(x_m ,m), f_k(x_n, n)>=g(x_m, x_n, m, n)$$

旋转位置编码就是找到一个使上式成立的位置编码方式. 

出于认识的目的，我们省略复杂的数学推导，直接看 RoPE 的的结论：

存在这样一个正交矩阵：

$$\boldsymbol{R}_{\Theta,m}^d=\underbrace{\begin{pmatrix}\cos m\theta_0&-\sin m\theta_0&0&0&\cdots&0&0\\\sin m\theta_0&\cos m\theta_0&0&0&\cdots&0&0\\0&0&\cos m\theta_1&-\sin m\theta_1&\cdots&0&0\\0&0&\sin m\theta_1&\cos m\theta_1&\cdots&0&0\\\vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\0&0&0&0&\cdots&\cos m\theta_{d/2-1}&-\sin m\theta_{d/2-1}&-\sin m\theta_{d/2-1}\end{pmatrix}}_{\boldsymbol{W}_m}$$

其中，$\Theta=\left\{\theta_i=10000^{-2(i-1)/d},i\in[1,2,\ldots,d/2]\right\}$

我们可以将 query 和 key 的内积操作转换为与原始向量 $x$ 相关的以下等价形式：

$$
\boldsymbol{q}_m^\mathbf{T}\boldsymbol{k}_n=\left(\boldsymbol{R}_{\Theta,m}^d\boldsymbol{W}_q\boldsymbol{x}_m\right)^\mathbf{T}\left(\boldsymbol{R}_{\Theta,n}^d\boldsymbol{W}_k\boldsymbol{x}_n\right)=\boldsymbol{x}_m^\mathbf{T}\boldsymbol{W}_q\boldsymbol{R}_{\Theta,n-m}^d\boldsymbol{W}_k\boldsymbol{x}_n
$$

其中， $\boldsymbol{R}_{\Theta,n-m}^d=\left(\boldsymbol{R}_{\Theta,m}^d\right)^\mathbf{T}\boldsymbol{R}_{\Theta,n}^d$.

由于 $\boldsymbol{R}_{\Theta,m}^d$ 的稀疏性，直接使用矩阵乘法会浪费算力，因此代码中采用下述方式实现：

$$\boldsymbol{R}_{\Theta,m}^{d}\boldsymbol{x}=\begin{pmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\\\vdots\\x_{d-2}\\x_{d-1}\end{pmatrix}\otimes\begin{pmatrix}\cos m\theta_{0}\\\cos m\theta_{0}\\\cos m\theta_{1}\\\cos m\theta_{1}\\\vdots\\\cos m\theta_{d/2-1}\\\cos m\theta_{d/2-1}\end{pmatrix}+\begin{pmatrix}-x_{1}\\x_{0}\\-x_{3}\\x_{2}\\\vdots\\-x_{d-1}\\x_{d-2}\end{pmatrix}\otimes\begin{pmatrix}\sin m\theta_{0}\\\sin m\theta_{0}\\\sin m\theta_{1}\\\sin m\theta_{1}\\\vdots\\\sin m\theta_{d/2-1}\\\sin m\theta_{d/2-1}\end{pmatrix}
$$

此处的ROPE的实现主要参考的是LLama的RoPE实现
[LLAMA实现](https://blog.csdn.net/m0_55846238/article/details/145728695)

In [4]:
import torch
def precompute_pos_cis(dim:int,seqlen=2048,theta=1e5):
    #这个函数是用来求出所有的mtheta的
    #首先就需要制备theta的值，参考上面的公式
    #theta序列
    freqs=1.0/(theta**(torch.arange(0,dim,2)[:dim//2].float()/dim))
    #m序列
    m=torch.arange(seqlen,device=freqs.device)
    #求出m x theta的大表
    freqs=torch.outer(m,freqs).float()
    #用极坐标表示cos + isin形式的mtheta
    pos_cis=torch.polar(torch.ones_like(freqs),freqs)
    #pos_cis的形状是(seqlen,dim)
    return pos_cis

#测试查看一下制出的表格
pos_cis=precompute_pos_cis(dim=8,seqlen=7,theta=1e5)
print("pos_cis:", pos_cis)
print("pos_cis shape:", pos_cis.shape)


pos_cis: tensor([[ 1.0000+0.0000e+00j,  1.0000+0.0000e+00j,  1.0000+0.0000e+00j,
          1.0000+0.0000e+00j],
        [ 0.5403+8.4147e-01j,  0.9984+5.6204e-02j,  1.0000+3.1623e-03j,
          1.0000+1.7783e-04j],
        [-0.4161+9.0930e-01j,  0.9937+1.1223e-01j,  1.0000+6.3245e-03j,
          1.0000+3.5566e-04j],
        [-0.9900+1.4112e-01j,  0.9858+1.6790e-01j,  1.0000+9.4867e-03j,
          1.0000+5.3348e-04j],
        [-0.6536-7.5680e-01j,  0.9748+2.2304e-01j,  0.9999+1.2649e-02j,
          1.0000+7.1131e-04j],
        [ 0.2837-9.5892e-01j,  0.9607+2.7748e-01j,  0.9999+1.5811e-02j,
          1.0000+8.8914e-04j],
        [ 0.9602-2.7942e-01j,  0.9436+3.3104e-01j,  0.9998+1.8973e-02j,
          1.0000+1.0670e-03j]])
pos_cis shape: torch.Size([7, 4])


In [3]:
def apply_rotary_emb(xq,xk,pos_cis):
    #将旋转编码应用到输入的tensor上
    xq_=torch.view_as_complex(xq.float().reshape(*xq.shape[:-1],-1,2))
    print("xq_,shape",xq_.shape)
    xk_=torch.view_as_complex(xk.float().reshape(*xk.shape[:-1],-1,2))
    print("xk_,shape",xk_.shape)
    #将pos_cis的维度调整为(xq_.shape[0],xq_.shape[1],-1,2)
    def unite_shape(pos_cis,  x):
        ndim = x.ndim
        assert 0 <= 1 < ndim
        print("xq_.shape=",xq_.shape)
        assert pos_cis.shape == (x.shape[1],  x.shape[-1]),f"pos_cis.shape:({pos_cis.shape}),(x.shape[1],  x.shape[-1])={(x.shape[1],  x.shape[-1])}"
        shape = [d if i == 1 or i == ndim - 1 else 1 for i,  d in enumerate(x.shape)]
        return pos_cis.view(*shape)
    pos_cis = unite_shape(pos_cis, xq_)
    #将pos_cis应用到xq_和xk_上(和输入对齐)
    print("pos_cis shape:", pos_cis.shape)
    print("xq_ shape:", xq_.shape)
    print("xk_ shape:", xk_.shape)

    xq_out=torch.view_as_real(xq_ * pos_cis).flatten(3)
    xk_out=torch.view_as_real(xk_ * pos_cis).flatten(3)
    return xq_out, xk_out
#测试一下apply_rotary_emb函数
xq = torch.randn(2, 3, 2,8)  # (bs, seqlen, dim)
xk = torch.randn(2, 3, 2,8)  # (bs, seqlen, dim)
pos_cis = precompute_pos_cis(dim=8, seqlen=3, theta=1e5)
xq_out, xk_out = apply_rotary_emb(xq, xk, pos_cis)
print("xq shape:", xq.shape)  # 应该是 (bs, seqlen, dim)
print("xk shape:", xk.shape)  # 应该是 (bs, seqlen, dim)

print("xq_out_shape:", xq_out.shape)  # 应该是 (bs, seqlen, dim
print("xk_out_shape:", xk_out.shape)  # 应该是 (bs, seqlen, dim
# print("xq_out:", xq_out)
# print("xk_out:", xk_out)
print("pos_cis:", pos_cis)
print("pos_cis shape:", pos_cis.shape)


xq_,shape torch.Size([2, 3, 2, 4])
xk_,shape torch.Size([2, 3, 2, 4])
xq_.shape= torch.Size([2, 3, 2, 4])
pos_cis shape: torch.Size([1, 3, 1, 4])
xq_ shape: torch.Size([2, 3, 2, 4])
xk_ shape: torch.Size([2, 3, 2, 4])
xq shape: torch.Size([2, 3, 2, 8])
xk shape: torch.Size([2, 3, 2, 8])
xq_out_shape: torch.Size([2, 3, 2, 8])
xk_out_shape: torch.Size([2, 3, 2, 8])
pos_cis: tensor([[ 1.0000+0.0000e+00j,  1.0000+0.0000e+00j,  1.0000+0.0000e+00j,
          1.0000+0.0000e+00j],
        [ 0.5403+8.4147e-01j,  0.9984+5.6204e-02j,  1.0000+3.1623e-03j,
          1.0000+1.7783e-04j],
        [-0.4161+9.0930e-01j,  0.9937+1.1223e-01j,  1.0000+6.3245e-03j,
          1.0000+3.5566e-04j]])
pos_cis shape: torch.Size([3, 4])


## Attention Block
MiniMind主要采用的GQA，所以我们采用GQA来进行复现即可，其中要考虑到KV_cache机制和是否使用flash_attention的scale_dot_attention

In [4]:
def repeat_kv_heads(x,rep_num):
    #该函数主要用来对齐kv和q的维度的,因为kv一般为(bs,seqlen,kv_head_num,head_dim)而q一般为(bs,seqlen,head_num,head_dim)
    #所以需要将kv的head_num重复rep_num次、
    bs,seqlen,kv_head_num,head_dim=x.shape
    if rep_num == 1:
        return x
    return (
        x[:,:,:,None,:].expand(bs,seqlen,kv_head_num,rep_num,head_dim)
        .reshape(bs,seqlen,kv_head_num*rep_num,head_dim)
    )
#测试一下repeat_kv_heads函数
x = torch.randn(2, 3, 4, 8)  # (
# bs, seqlen, kv_head_num, head_dim)
rep_num = 3
x_repeated = repeat_kv_heads(x, rep_num)
print("x shape:", x.shape)  
print("x_repeated shape:", x_repeated.shape)  # 应该是 (bs, seqlen, kv_head_num * rep_num, head_dim)

x shape: torch.Size([2, 3, 4, 8])
x_repeated shape: torch.Size([2, 3, 12, 8])


这里我一直的疑惑是为什么Query和Value会进行位置编码，而Value不用进行位置编码
现在的解答是：因为Query和Value本身的作用是产生特定的注意力分数，这个注意力分数要体现位置关系，这样才可以正确反映合适的句子的关系<br>
而得到这个注意力分数以后，其注意力本身就已经参入了位置信息，所以Value可以不用再进行编码处理了。

In [5]:
import torch
from torch import nn
import torch.nn.functional as F
import math
class GroupQueryAttention(nn.Module):
    def __init__(self,embed_dim,head_num,kv_head_num,dropout=0.1,Flash=False,max_seqlen=2048,training=True):
        super(GroupQueryAttention,self).__init__()
        self.embed_dim=embed_dim
        self.head_num=head_num
        self.kv_head_num=kv_head_num
        #关于多头处理的设定
        self.head_dim=embed_dim//head_num
        assert self.head_dim * self.head_num == embed_dim, "embed_dim must be divisible by head_num"
        self.rep_num=head_num//kv_head_num
        assert self.rep_num * self.kv_head_num == self.head_num, "head_num must be divisible by kv_head_num"

        #设置qkvo四个linear层
        self.q_proj=nn.Linear(embed_dim,self.head_num*self.head_dim)
        self.k_proj=nn.Linear(embed_dim,self.kv_head_num*self.head_dim)
        self.v_proj=nn.Linear(embed_dim,self.kv_head_num*self.head_dim)
        self.o_proj=nn.Linear(self.head_num*self.head_dim,self.embed_dim)

        #dropout和flash_attention设置
        self.training=training  #主要是在flash attention中使用,如果是tranning=True，则dropout会被使用，否则在推理模式下不会使用flash attention
        self.dropout=dropout
        self.attn_dropout=nn.Dropout(dropout)
        self.res_dropout=nn.Dropout(dropout)
        self.Flash=hasattr(F, 'scaled_dot_product_attention') and Flash
        #注册因果掩码
        #注意由于mask是作用在q*k的结果上的,q*k的shape是(bs,head_num,seqlen,seqlen)的矩阵，所以mask也是这个形状
        mask=torch.full((1,1,max_seqlen,max_seqlen),float(-1e9))
        mask=torch.tril(mask, diagonal=0)
        self.register_buffer("mask", mask, persistent=False)

    def forward(self,x,
                pos_cis=None,
                past_key_value=None,
                use_cache=False):
        #输入x,pos_cis用来进行旋转编码处理
        #past_key_value用来进行缓存处理，如果use_cache=True，则past_key_value会被使用
        #返回值为attn_output, attn_weights
        #### 获取x的形状信息 ####
        bs,seqlen,_=x.shape
        #### 进行分头处理 ####
        xq=self.q_proj(x).view(bs,seqlen,self.head_num,self.head_dim)
        xk=self.k_proj(x).view(bs,seqlen,self.kv_head_num,self.head_dim)
        xk=repeat_kv_heads(xk,self.rep_num)
        xv=self.v_proj(x).view(bs,seqlen,self.kv_head_num,self.head_dim)
        xv=repeat_kv_heads(xv,self.rep_num)
        #### RoPE处理 ####
        xq,xk=apply_rotary_emb(xq,xk,pos_cis)
        #### KV_cache处理(仅在推理模式可用) ####
        if past_key_value is not None:
            #past_key_value的形状是(bs, seqlen, kv_head_num, head_dim)
            #将past_key_value的形状调整为(bs, seqlen, kv_head_num, head_dim)
            xk=torch.cat([past_key_value[0], xk], dim=1)
            xv=torch.cat([past_key_value[1], xv], dim=1)
        past_kv=(xk,xv) if use_cache else None
        #### qkv形状调整，进行计算 #### 
        xq,xk,xv=(
            xq.transpose(1, 2),
            xk.transpose(1, 2),
            xv.transpose(1, 2)
        )
        #### 计算注意力权重 ####
        if self.Flash and seqlen!=1:
            # 在使用Flash Attention且计算序列长度不为1时，使用Flash Attention，seqlen!=1是因为seqlen=1时，Flash Attention会报错
            dropout_p=self.dropout if self.training else 0.0
            output=F.scaled_dot_product_attention(
                xq,xk,xv,
                attn_mask=None,#这里的attn_mask是None，因为下面设置了is_causal=True,所以会自动应用因果下三角掩码，这是flash_attn的机制决定的
                dropout_p=dropout_p,
                is_causal=True
            )
        else:
            #不使用flash attention时，使用传统的注意力计算方式
            scores=torch.matmul(xq,xk.transpose(-2,-1))/math.sqrt(self.head_dim)
            scores+=self.mask[:,:,:seqlen,:seqlen]
            attn_weights=F.softmax(scores,dim=-1)
            attn_weights=self.attn_dropout(attn_weights)
            output=torch.matmul(attn_weights,xv)
        #### 整合output ####
        output=output.transpose(1,2).reshape(bs,seqlen,-1)
        output=self.res_dropout(self.o_proj(output))
        return output,past_kv


# 修正后的测试程序
input_ids = torch.rand(2, 3, 128)  # 批次大小=2, 序列长度=3, 嵌入维度=64
# 修正：传递正确的vocab_size（假设为1000），embed_dim=64, head_num=8, kv_head_num=4
GQALayer = GroupQueryAttention(embed_dim=128, head_num=8, kv_head_num=4)
# 创建随机的pos_cis用于测试
pos_cis = precompute_pos_cis(16,3)
output, _ = GQALayer(input_ids, pos_cis=pos_cis) 
print("output shape:", output.shape)
print("output:", output)

xq_,shape torch.Size([2, 3, 8, 8])
xk_,shape torch.Size([2, 3, 8, 8])
xq_.shape= torch.Size([2, 3, 8, 8])
pos_cis shape: torch.Size([1, 3, 1, 8])
xq_ shape: torch.Size([2, 3, 8, 8])
xk_ shape: torch.Size([2, 3, 8, 8])
output shape: torch.Size([2, 3, 128])
output: tensor([[[-1.0662e-01, -1.4046e-01, -1.6654e-01,  1.2609e-01,  3.1025e-01,
          -1.6245e-01,  1.6930e-01,  1.7007e-01,  3.0667e-01,  1.0035e-01,
          -6.2252e-02, -5.5633e-03,  3.1303e-02, -4.2615e-02,  0.0000e+00,
           2.3188e-01,  1.6678e-01,  2.0312e-01, -3.5302e-02, -0.0000e+00,
           2.9188e-02, -0.0000e+00, -8.3818e-02,  1.7378e-01,  4.8121e-02,
          -2.6933e-01,  2.0837e-01, -1.0619e-01, -4.8884e-02, -1.3271e-01,
           3.9362e-01, -9.6815e-02,  1.3962e-01,  6.8465e-02, -6.6692e-03,
          -5.6169e-02, -3.6663e-01,  1.9279e-01, -1.8567e-01, -4.4651e-01,
           0.0000e+00,  0.0000e+00, -1.2596e-01,  0.0000e+00, -3.1567e-01,
          -2.1245e-01, -4.8691e-02,  8.1511e-02, -2.5524e-01,

## RMSNorm
### 均方根层归一化 (Root Mean Square Layer Normalization, RMSNorm)

RMSNorm 是对 LayerNorm 的一个改进,  没有做 re-center 操作（移除了均值项）, 可以看作 LayerNorm 在均值为零时的特例, 使用平方根均值归一化降低噪声影响。

- **Layer Norm**

$$y = \frac{x-E(x)}{\sqrt{Var(x) + \epsilon}} * \gamma + \beta$$

假设输入张量形状为 (batch_size,  sequence_length,  embedding_dim), 层归一化对 embedding_dim 维度进行归一化操作, 其中,  $\epsilon$ 是一个超参数, 用于防止分母为零导致结果上溢,  $\gamma$,  $\beta$ 均为可学习参数。

- **RMS Norm**

$$a_i=\frac{a_i}{RMS(a) + \epsilon} * \gamma,  \quad where \quad RMS(a) = \sqrt{\frac{1}{n}\sum^n_{i=1}a^2_i}.$$

假设输入张量形状为 (batch_size,  sequence_length,  embedding_dim), RMS Norm 对 embedding_dim 维度进行归一化,其中,  其中,  $\epsilon$ 是一个超参数, 用于防止分母为零导致结果上溢, $\gamma$ 为可学习参数.

不难发现, 当均值为零时, Layer Norm 退化为 RMS Norm. 这是因为 RMS Norm 在 Layer Norm 的基础上舍弃了中心化操作, 仅用缩放进行归一化, 其不改变数据原本的分布, 有利于激活函数输出的稳定.

In [6]:
class RMSNorm(nn.Module):
    def __init__(self,dim,eps=1e-9):
        super(RMSNorm,self).__init__()
        self.eps=eps
        self.weight=nn.Parameter(torch.ones(dim))
    def forward(self,x):
        return self.weight*x*torch.rsqrt(x.pow(2).mean(-1,keepdim=True)+self.eps)
#测试RMSNorm
x=torch.randn(2,2,3)
rmslayer=RMSNorm(3,1e-7)
output=rmslayer(x)
print("Input Tensor:\n", x)
print("Output Tensor:\n", output)

Input Tensor:
 tensor([[[ 1.2416, -0.9110, -2.1397],
         [-0.4401, -0.0174,  1.5917]],

        [[-0.4660,  0.8371, -1.8996],
         [ 0.3148, -0.7884,  0.8145]]])
Output Tensor:
 tensor([[[ 0.8157, -0.5985, -1.4058],
         [-0.4615, -0.0183,  1.6693]],

        [[-0.3794,  0.6815, -1.5465],
         [ 0.4635, -1.1607,  1.1991]]], grad_fn=<MulBackward0>)


## FFN

In [7]:
import torch
from torch import nn
import torch.nn.functional as F
class FeedForward(nn.Module):
    def __init__(self,embed_dim,hidden_dim,dropout=0.1):
        super(FeedForward,self).__init__()
        self.up_proj=nn.Linear(embed_dim,hidden_dim)
        self.gate_proj=nn.Linear(embed_dim,hidden_dim)
        self.down_proj=nn.Linear(hidden_dim,embed_dim)
        self.dropout=nn.Dropout(dropout)
    def forward(self,x):
        residual=F.silu(self.gate_proj(x))
        residual=self.dropout(residual)
        x=self.up_proj(x)
        x+=residual
        x=self.down_proj(x)
        return x

bs, seqlen, embed_dim = 2, 3, 4
ffn_dim = 8
x = torch.randn(bs, seqlen, embed_dim)
ffn_layer = FeedForward(embed_dim, ffn_dim)
output = ffn_layer(x)
print("Input Tensor:\n", x)
print("Output Tensor:\n", output)
#测试FeedForward的输出形状
print("Output Tensor Shape:", output.shape)



Input Tensor:
 tensor([[[ 1.5002,  2.4258, -0.3180,  0.5998],
         [ 2.0557, -0.1931, -1.3039, -0.5660],
         [-0.3713,  0.5173, -0.1114, -1.2280]],

        [[ 0.3649,  0.3621, -0.5287,  1.4578],
         [-0.4595,  0.4389,  0.2427, -0.3213],
         [-1.0624, -0.6268, -0.3982,  1.1072]]])
Output Tensor:
 tensor([[[ 1.0263, -0.5621, -0.5764,  0.2843],
         [ 0.3159,  0.0451, -1.1959, -0.2367],
         [ 0.5443,  0.1495, -0.3052, -0.4412]],

        [[ 0.6623, -0.6118, -0.5171,  0.2873],
         [ 0.7190, -0.2367, -0.1903, -0.1967],
         [ 0.5744, -0.4669,  0.0507,  0.2224]]], grad_fn=<ViewBackward0>)
Output Tensor Shape: torch.Size([2, 3, 4])


# 组成MinimindBlock

In [8]:
class MinimindBlock(nn.Module):
    def __init__(self,layer_id,seqlen,embed_dim,head_num,kv_head_num,hidden_dim):
        super(MinimindBlock,self).__init__()
        self.layer_id=layer_id
        self.seqlen=seqlen
        self.embed_dim=embed_dim
        self.head_num=head_num
        self.kv_head_num=kv_head_num
        self.hidden_dim=hidden_dim
        self.head_dim=self.embed_dim//self.head_num
        
        assert self.embed_dim==self.head_num*self.head_dim,"head_num must be integer"
        #### 组件初始化
        self.norm1=RMSNorm(self.embed_dim)
        self.norm2=RMSNorm(self.embed_dim)
        self.attn1=GroupQueryAttention(self.embed_dim,self.head_num,self.kv_head_num)
        self.ffn=FeedForward(self.embed_dim,self.hidden_dim)
        #### 额外的初始化
        print("head_dim=",self.head_dim)
        pos_cis=precompute_pos_cis(self.head_dim,self.seqlen)
        self.register_buffer("pos_cis",pos_cis)
        print("pos_cis.shape",pos_cis.shape)

    def forward(self,x):
        x=self.norm1(x)
        x,_=self.attn1(x,self.pos_cis)
        x=self.norm2(x)
        x=self.ffn(x)
        return x

bs,seqlen,embed_dim=2,64,128
head_num=16
kv_head_num=4
hidden_dim=1024
input_ids=torch.randn(bs,seqlen,embed_dim)
print("input_ids.shape=",input_ids.shape)
block1=MinimindBlock(1,seqlen,embed_dim,head_num,kv_head_num,hidden_dim)
res=block1(input_ids)
print(res.shape)

        

input_ids.shape= torch.Size([2, 64, 128])
head_dim= 8
pos_cis.shape torch.Size([64, 4])
xq_,shape torch.Size([2, 64, 16, 4])
xk_,shape torch.Size([2, 64, 16, 4])
xq_.shape= torch.Size([2, 64, 16, 4])
pos_cis shape: torch.Size([1, 64, 1, 4])
xq_ shape: torch.Size([2, 64, 16, 4])
xk_ shape: torch.Size([2, 64, 16, 4])
torch.Size([2, 64, 128])


# MiniMindLM Dense
回顾结构<br>
![img](../images/LLM-structure.png)

In [16]:
class MinimindLM(nn.Module):
    def __init__(self,blocknum,vocab_size,seqlen,embed_dim,head_num,kv_head_num,hidden_dim):
        super(MinimindLM,self).__init__()
        #### 属性继承 ####
        self.blocknum=blocknum
        self.vocab_size=vocab_size
        self.seqlen=seqlen
        self.embed_dim=embed_dim
        self.head_num=head_num
        self.kv_head_num=kv_head_num
        self.hidden_num=hidden_dim
        self.params=(seqlen,embed_dim,head_num,kv_head_num,hidden_dim)
        #### 定义各个部件 ####
        self.embed=nn.Embedding(self.vocab_size,self.embed_dim)
        self.blocks=nn.ModuleList([MinimindBlock(i,*(self.params)) for i in range(self.blocknum) ])
        self.norm=RMSNorm(self.embed_dim)
        self.linear=nn.Linear(self.embed_dim,self.vocab_size)
        
    def forward(self,x):
        x=self.embed(x)
        for block in self.blocks:
            x = block(x)
        x=self.norm(x)
        x=self.linear(x)
        x=F.softmax(x,dim=-1)
        return x

blocknum=2
vocab_size=1024
seqlen=32
embed_dim=64
head_num=16
kv_head_num=4
hidden_dim=256
params=(blocknum,vocab_size,seqlen,embed_dim,head_num,kv_head_num,hidden_dim)
input_ids=torch.randint(0,1024,(2,32))
minimindLM=MinimindLM(*params)
res=minimindLM(input_ids)
print("input_ids.shape=",input_ids.shape)
print("input_ids=",input_ids)
print("res.shape=",res.shape)
print("res=",res)


head_dim= 4
pos_cis.shape torch.Size([32, 2])
head_dim= 4
pos_cis.shape torch.Size([32, 2])
xq_,shape torch.Size([2, 32, 16, 2])
xk_,shape torch.Size([2, 32, 16, 2])
xq_.shape= torch.Size([2, 32, 16, 2])
pos_cis shape: torch.Size([1, 32, 1, 2])
xq_ shape: torch.Size([2, 32, 16, 2])
xk_ shape: torch.Size([2, 32, 16, 2])
xq_,shape torch.Size([2, 32, 16, 2])
xk_,shape torch.Size([2, 32, 16, 2])
xq_.shape= torch.Size([2, 32, 16, 2])
pos_cis shape: torch.Size([1, 32, 1, 2])
xq_ shape: torch.Size([2, 32, 16, 2])
xk_ shape: torch.Size([2, 32, 16, 2])
input_ids.shape= torch.Size([2, 32])
input_ids= tensor([[ 512,  565,  963,  532,  573,  872,  708,  606,  501,  928,  761,  671,
          713,  422,  354,  466,  335,  128,  363,  451,  600,  882,    5,  744,
          532,  139,  112,  937,  444,    3,  896,  258],
        [ 961,  552,  368,  464,  521,  134,  546,  799,  630,  232,  176,  152,
          730,  389,  392,   78,  102,  338,  985,  479,  832,  679,  847,  936,
          245,  333,

## 目前存在的一些疑问
- 输入的input_ids 过长/过短如何处理
- padding和mask的具体作用机制
- 如何来把tokenizer结合进去
- 如何利用transformers库来

# 利用transformers库封装这个最原始的模型
[关于transformers库的说明](https://cloud.tencent.com/developer/article/2367010)

generate和forward最大区别在于：generate仅仅用于推理，而forward则训练和推理都能使用