# 搭建我们的model
![minimind结构](../images/LLM-structure.png)

## 搭建的参照：
参考llama3和qwen，设计的LLM结构<br>
我们按照自底向上的方式，一层一层的复现即可

### 0-Tokenizer
由于tokenizer是输入model之前的一层，因此我们暂时不使用该层，我们假设事先得到了一些输入的word2vec转成的向量<br>
假定tokenizer映射的词表大小为6400,和我们训练的tokenizer大小一致

### 1-Embedding
输入的embedding层，对应将原始的input_ids压缩为密集的编码形式<br>
关于embedding:[什么是embedding层](https://zhuanlan.zhihu.com/p/164502624)，[embedding的前世今生](https://zhuanlan.zhihu.com/p/1916927561000255869)

In [1]:
import torch
from torch import nn
input_ids=torch.randint(0,6400,(2,10))  # 假设输入的input_ids为2个句子，每个句子10个token
print(input_ids)  # 输出形状应为(2, 10)

class Embedding(nn.Module):
    def __init__(self,vocab_size,embed_dim):
        super(Embedding,self).__init__()
        self.embedding = nn.Embedding(vocab_size,embed_dim)
    def forward(self,input_ids):
        return self.embedding(input_ids)
# Example usage
vocab_size = 6400  # 假设词表大小为6400
embed_dim = 512  # 假设嵌入维度为512
embedding_layer = Embedding(vocab_size, embed_dim)
embedded_input = embedding_layer(input_ids)
print(embedded_input.shape)  # 输出形状应为(2, 10, 512)
print(embedded_input)  # 输出嵌入后的向量


tensor([[2030, 6373, 3852, 6064, 3649, 4024, 6350, 4501, 5165, 4303],
        [2358, 2897, 5469, 1784, 4995, 4043,  932, 6109,  556, 1158]])
torch.Size([2, 10, 512])
tensor([[[ 0.2758,  0.4148,  1.1885,  ..., -0.6679,  0.1490,  0.5232],
         [-2.6458, -0.8047, -0.6296,  ..., -0.6315,  0.1417,  0.6417],
         [ 1.3939,  1.1044, -0.7613,  ..., -1.7296, -0.4728, -0.7427],
         ...,
         [ 2.5562, -0.4796,  1.1946,  ..., -0.6231,  1.1949,  1.8410],
         [ 0.2589,  0.8934, -0.5616,  ...,  0.6162, -0.1266,  2.2751],
         [-0.7783,  0.4408,  0.5889,  ...,  0.0514, -1.5111,  1.1941]],

        [[-1.1211,  0.5346, -0.3174,  ..., -1.5841, -1.0099, -0.6460],
         [ 0.4476, -0.7232,  0.0389,  ...,  4.0461, -1.1284,  0.4496],
         [-1.0826, -2.5474,  0.2178,  ..., -1.2754, -0.4207,  0.0776],
         ...,
         [-0.4938, -1.9396, -0.2887,  ..., -2.2465, -0.6882,  1.4206],
         [-0.6993, -1.3383, -0.6565,  ..., -0.1087,  1.3399,  0.8225],
         [-0.3235, -1.1

### 2-MiniMind-Block
Minimind-Block 也就是transformer-block<br>
其主要作用是通过transformer来学习并提取input_ids的有效feature<br>
minimind-block主要组件为
- RMSNorm
- GQA(attention)
- RoPE(位置编码)
- FFN

#### 2.1 RMSNorm
RMSNorm是LLama模型中提出的一种新颖的归一化方式[BatchNorm,LayerNorm,RMSNorm介绍](https://blog.csdn.net/wxc971231/article/details/139925707)<br>
[苏剑林-关于norm的放置,pre or post?](https://kexue.fm/archives/9009)
- **LayerNorm**
    主要计算公式为：
    $$\frac{x-E(x)}{\sqrt{Var(x)+\epsilon}} * \beta$$
    其涉及到计算Ex和Varx,计算量偏大
- **RMSNorm**
    主要计算公式为：
    $$a_i=\frac{a_i}{RMS(a)+\epsilon} * \gamma \quad where \quad RMS(a) = \sqrt{\frac{1}{n}\sum^n_{i=1}a^2_i}$$
    RMSNorm的主要优点是降低了计算量，并且提高了计算的稳定程度。

In [5]:
class RMSNorm(nn.Module):
    def __init__(self,embed_dim,eps=1e-6):
        super(RMSNorm,self).__init__()
        self.embed_dim=embed_dim
        self.eps=eps
        self.weight=nn.Parameter(torch.ones(embed_dim))
    def forward(self,X):
        """
        X= (batch_size, seq_len, embed_dim)
        """
        return X*self.weight*torch.rsqrt(torch.mean(X.pow(2),dim=-1,keepdim=True)+self.eps)
# Example usage
embed_dim=2
X= torch.randn(2, 2, embed_dim)  # 假设输入的X为2个句子，每个句子10个token，每个token的嵌入维度为512
rmsnorm_layer = RMSNorm(embed_dim)
normalized_output = rmsnorm_layer(X)
print(X)
print(normalized_output)

tensor([[[-1.3649, -1.8068],
         [-0.5323, -0.3894]],

        [[ 0.8681,  0.3576],
         [ 1.2855,  0.3212]]])
tensor([[[-0.8524, -1.1284],
         [-1.1413, -0.8351]],

        [[ 1.3076,  0.5386],
         [ 1.3720,  0.3428]]], grad_fn=<MulBackward0>)


#### 2.2 RoPE旋转编码
##### Rotary Position Embedding, RoPE

旋转位置编码是一种能将相对位置信息集成到 self-attention 中, 进而提升 transformer 架构性能的位置编码方式, 和绝对位置编码相比, RoPE 具有很好的外推性, 是目前的主流位置编码方式.

外推性的解释, 通俗来说就是训练的时候限制了 512 的上下文长度，那么推理时如果面对超过该长度的文本，LLM 可能无法正确处理.

- **绝对位置编码**

绝对位置编码是早期 Transformer 架构采用的绝对位置编码方案，及那个每个位置映射为固定的向量表示.

$$f_{t:t\in\{q,k,v\}}(\boldsymbol{x}_i,i)=\boldsymbol{W}_{t:t\in\{q,k,v\}}(\boldsymbol{x}_i+\boldsymbol{p}_i)$$

其中编码向量 $p_i$ 的计算使用如下公式：

$$\boldsymbol{p}_{i,2t}=\sin\left(k/1000^{2t/d}\right), \boldsymbol{p}_{i,2t+1}=\cos\left(k/1000^{2t/d}\right)$$

正如其名，绝对位置编码只考虑了输入序列中的绝对位置关系，对于 token 之间的相对信息则没有纳入考虑.

- **旋转位置编码**

假定 query 和 key 的内积操作可以被函数 g 表示，该函数 g 的输入是词嵌入向量 $x_m, x_n$ 和它们之间的相对位置 $m-n$:

$$<f_q(x_m ,m), f_k(x_n, n)>=g(x_m, x_n, m, n)$$

旋转位置编码就是找到一个使上式成立的位置编码方式. 

出于认识的目的，我们省略复杂的数学推导，直接看 RoPE 的的结论：

存在这样一个正交矩阵：

$$\boldsymbol{R}_{\Theta,m}^d=\underbrace{\begin{pmatrix}\cos m\theta_0&-\sin m\theta_0&0&0&\cdots&0&0\\\sin m\theta_0&\cos m\theta_0&0&0&\cdots&0&0\\0&0&\cos m\theta_1&-\sin m\theta_1&\cdots&0&0\\0&0&\sin m\theta_1&\cos m\theta_1&\cdots&0&0\\\vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\0&0&0&0&\cdots&\cos m\theta_{d/2-1}&-\sin m\theta_{d/2-1}&-\sin m\theta_{d/2-1}\end{pmatrix}}_{\boldsymbol{W}_m}$$

其中，$\Theta=\left\{\theta_i=10000^{-2(i-1)/d},i\in[1,2,\ldots,d/2]\right\}$

我们可以将 query 和 key 的内积操作转换为与原始向量 $x$ 相关的以下等价形式：

$$
\boldsymbol{q}_m^\mathbf{T}\boldsymbol{k}_n=\left(\boldsymbol{R}_{\Theta,m}^d\boldsymbol{W}_q\boldsymbol{x}_m\right)^\mathbf{T}\left(\boldsymbol{R}_{\Theta,n}^d\boldsymbol{W}_k\boldsymbol{x}_n\right)=\boldsymbol{x}_m^\mathbf{T}\boldsymbol{W}_q\boldsymbol{R}_{\Theta,n-m}^d\boldsymbol{W}_k\boldsymbol{x}_n
$$

其中， $\boldsymbol{R}_{\Theta,n-m}^d=\left(\boldsymbol{R}_{\Theta,m}^d\right)^\mathbf{T}\boldsymbol{R}_{\Theta,n}^d$.

由于 $\boldsymbol{R}_{\Theta,m}^d$ 的稀疏性，直接使用矩阵乘法会浪费算力，因此代码中采用下述方式实现：

$$\boldsymbol{R}_{\Theta,m}^{d}\boldsymbol{x}=\begin{pmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\\\vdots\\x_{d-2}\\x_{d-1}\end{pmatrix}\otimes\begin{pmatrix}\cos m\theta_{0}\\\cos m\theta_{0}\\\cos m\theta_{1}\\\cos m\theta_{1}\\\vdots\\\cos m\theta_{d/2-1}\\\cos m\theta_{d/2-1}\end{pmatrix}+\begin{pmatrix}-x_{1}\\x_{0}\\-x_{3}\\x_{2}\\\vdots\\-x_{d-1}\\x_{d-2}\end{pmatrix}\otimes\begin{pmatrix}\sin m\theta_{0}\\\sin m\theta_{0}\\\sin m\theta_{1}\\\sin m\theta_{1}\\\vdots\\\sin m\theta_{d/2-1}\\\sin m\theta_{d/2-1}\end{pmatrix}
$$

简而言之，RoPE就是用绝对编码的形式，表示出相对编码的关系，这样同时具有了绝对编码的简洁和相对编码的位置信息泛化性<br>
此处的ROPE的实现主要参考的是LLama的RoPE实现
[LLAMA实现](https://blog.csdn.net/m0_55846238/article/details/145728695)<br>
对旋转编码理解困难，可以参考[无痛理解RoPE](https://zhuanlan.zhihu.com/p/8306958113)

具体而言，旋转编码有两种实现，一种是Qwen的实现，一种是LLama的实现<br>
我们首先试着参考llama的实现来理解,结合[LLAMA实现](https://blog.csdn.net/m0_55846238/article/details/145728695)<br>
llama的实现主要通过

$$
\begin{align}
f_q(\boldsymbol{x}_m, m) &= (\boldsymbol{W}_q \boldsymbol{x}_m) e^{im\theta} \\
f_k(\boldsymbol{x}_n, n) &= (\boldsymbol{W}_k \boldsymbol{x}_n) e^{in\theta} \\
g(\boldsymbol{x}_m, \boldsymbol{x}_n, m - n) &= \text{Re}\left[ (\boldsymbol{W}_q \boldsymbol{x}_m)^* (\boldsymbol{W}_k \boldsymbol{x}_n) e^{i(n - m)\theta} \right]
\end{align}
$$
这一原理实现的，主要流程为：计算所有的$\theta_i$，再利用$\cos m\theta_i$的表格来依次进行q,k,v的位置信息的转化

In [15]:
def precompute_pos_cis(embed_dim,seqlen,theta=1e5):
    """
    先计算出所有的theta
    embed_dim: embedding的维度
    seqlen: 序列长度
    theta: theta的值，默认1e5
    embed_dim和seqlen都是用于计算cis的
    """
    freqs = 1/(theta**(torch.arange(0,embed_dim,2)[:embed_dim//2].float())/embed_dim)
    print(freqs.shape)
    m=torch.arange(seqlen,device=freqs.device)
    print(m.shape)
    freqs=torch.outer(m,freqs).float() #计算mtheta
    pos_cis=torch.polar(torch.ones_like(freqs),freqs) # polor是将实数转为复数，也就是把freqs极坐标化了
    print(pos_cis.shape)
    return pos_cis

def apply_rotary(xq,xk,pos_cis):
    """
    xq: (batch_size, seq_len, head_num,embed_dim)
    xk: (batch_size, seq_len, head_num,embed_dim)
    pos_cis: (seqlen, embed_dim//2)
    """
    ## pos_cis的形状一般是比xq,xk要大一些的，所以可能会遇到需要对齐的情况
    xq_=torch.view_as_complex(xq.float().reshape(*xq.shape[:-1],-1,2))
    xk_=torch.view_as_complex(xk.float().reshape(*xk.shape[:-1],-1,2))
    def unite_shape(pos_cis,x):
        ndim=x.ndim
        assert ndim>=1
        # 确保x形状为[bs,seqlen,n_heads,embed_dim]
        assert pos_cis.shape==(x.shape[1],x.shape[-1])
        shape = [d if i == 1 or i == ndim - 1 else 1 for i,  d in enumerate(x.shape)]
        return pos_cis.view(*shape)
    pos_cis=unite_shape(pos_cis,xq_)
    xq_out= torch.view_as_real(xq_ * pos_cis).flatten(3)
    xk_out= torch.view_as_real(xk_ * pos_cis).flatten(3)
    return xq_out,xk_out

# Example usage
inputs=torch.randn(2,3,4,2) #模拟3个heads的情况
pos_cis=precompute_pos_cis(2,3)
xq,xk=apply_rotary(inputs,inputs,pos_cis)
print(inputs.shape)  # 输出形状应为(2, 3, 4, 2)
print(xq.shape)

torch.Size([1])
torch.Size([3])
torch.Size([3, 1])
torch.Size([2, 3, 4, 2])
torch.Size([2, 3, 4, 2])


#### 2.3 GQA&MHA
这里就进入到我们熟悉的注意力环节了<br>
GQA就是把MHA弄成了多个Query对应一个Key来进行的<br>
回顾block的样子![llm](../images/LLM-structure.png)

首先是GQA中必须的辅助函数，将Q和KV对齐维度

In [17]:
def repeat_kv(x,rep_num):
    """
    将x对应的k,v重复rep_num次,用来对其Q
    """
    if rep_num == 1:
        return X
    bs,seqlen,head_num,head_dim= x.shape
    return x[:,:,:,None,:].expand(bs,seqlen,head_num,rep_num,head_dim).reshape(bs,seqlen,head_num*rep_num,head_dim)

# Example usage
X=torch.randn(2,3,4,2) #模拟3个heads的情况
rep_num=2
repeated_X=repeat_kv(X,rep_num)
print(repeated_X.shape)  # 输出形状应为(2, 3, 8, 2)


torch.Size([2, 3, 8, 2])


In [30]:
import torch.nn.functional as F
import math
class GroupQueryAttention(nn.Module):
    def __init__(self,embed_dim,head_num,kv_head_num,dropout=0.1,Flash=False,max_seqlen=1024):
        super(GroupQueryAttention,self).__init__()
        ## 属性
        self.embed_dim=embed_dim
        self.head_num=head_num
        self.kv_head_num=kv_head_num
        self.head_dim=embed_dim//head_num
        self.kv_head_dim=self.head_dim
        self.rep_num=self.head_num//self.kv_head_num
        assert self.rep_num * self.kv_head_num == self.head_num, "head_num must be divisible by kv_head_num"
        self.dropout=dropout
        self.Flash=hasattr(F,"scaled_dot_product_attention") and Flash
        assert embed_dim == head_num * self.head_dim, "embed_dim must be divisible by head_num"
        self.scale= math.sqrt(self.head_dim)
        ## 网络
        self.q_proj=nn.Linear(embed_dim,self.head_dim*self.head_num)
        self.k_proj=nn.Linear(embed_dim,self.kv_head_num*self.head_dim)
        self.v_proj=nn.Linear(embed_dim,self.kv_head_num*self.head_dim)
        self.o_proj=nn.Linear(self.head_num*self.head_dim,self.embed_dim)
        self.attn_dropout=nn.Dropout(dropout)
        self.res_dropout=nn.Dropout(dropout)
        ## 临时性质参数，如mask和pos_cis
        ## 因果掩码初始化
        ## mask形状为(bs, head_num,seqlen, seqlen)
        mask=torch.full((1,1,max_seqlen,max_seqlen),float("-1e9"))
        mask=torch.tril(mask,diagonal=0)
        self.register_buffer("mask",mask)
    
    def forward(self,X,
                pos_cis=None,
                use_cache=False,
                past_key_value=None,
                ):
        bs,seqlen,embed_dim=X.shape
        xq=self.q_proj(X).view(bs,seqlen,self.head_num,self.head_dim)
        xk=self.k_proj(X).view(bs,seqlen,self.kv_head_num,self.head_dim)
        xv=self.v_proj(X).view(bs,seqlen,self.kv_head_num,self.head_dim)
        xk=repeat_kv(xk,self.rep_num)
        xv=repeat_kv(xv,self.rep_num)

        if pos_cis is None:
            pos_cis=precompute_pos_cis(self.head_dim,seqlen)
        xq,xk=apply_rotary(xq,xk,pos_cis)
        if  past_key_value is not None:
            xk=torch.cat([past_key_value[0],xk],dim=1)
            xv=torch.cat([past_key_value[1],xv],dim=1)
        past_key_value=(xk,xv) if use_cache else None

        xq,xk,xv= xq.transpose(1,2),xk.transpose(1,2),xv.transpose(1,2)

        if self.Flash:
            attn_output,attn_weights= F.scaled_dot_product_attention(
                xq,xk,xv,dropout_p=self.dropout,is_causal=True
            )
        else:
            attn_weights=torch.matmul(xq,xk.transpose(-2,-1))/self.scale
            attn_weights=attn_weights+self.mask[:,:,:seqlen,:seqlen]
            attn_weights=self.attn_dropout(F.softmax(attn_weights,dim=-1))
            attn_output=torch.matmul(attn_weights,xv)
            attn_output=self.res_dropout(attn_output)
            attn_output=attn_output.transpose(1,2).reshape(bs,seqlen,self.head_num*self.head_dim)
        attn_output=self.o_proj(attn_output)
        return attn_output,past_key_value
# Example usage
embed_dim = 512  # 假设嵌入维度为512
head_num = 8  # 假设头数为8
kv_head_num = 4  # 假设键值头数为4
dropout = 0.1  # 假设dropout率为0.1
max_seqlen = 1024  # 假设最大序列长度为102

gqa=GroupQueryAttention(embed_dim, head_num, kv_head_num, dropout, max_seqlen=max_seqlen)
output, past_key_value = gqa(embedded_input,use_cache=True)
print(output.shape)  # 输出形状应为(2, 10, 512)
print(past_key_value[0].shape)  # 输出past_key_value的形状应为((2, 10, 4, 128), (2, 10, 4, 128))



torch.Size([32])
torch.Size([10])
torch.Size([10, 32])
torch.Size([2, 10, 512])
torch.Size([2, 10, 8, 64])


#### 2.4 FFN
FFN的选取是MOE和Dense的主要区别，MoE是多个FFN
由于我们这个版本仅仅只是torch的原生版本，暂时不使用transformers库包装，所以我们采用dense模型来进行演示，MoE将在后续上传

In [31]:
class FeedForward(nn.Module):
    def __init__(self,embed_dim,ffn_dim,dropout=0.1):
        super(FeedForward,self).__init__()
        ## 基本属性
        self.embed_dim=embed_dim
        self.ffn_dim=ffn_dim
        self.dropout=dropout
        ## 网络
        self.gate=nn.Linear(self.embed_dim,self.ffn_dim)
        self.up_proj=nn.Linear(self.embed_dim,self.ffn_dim)
        self.down_proj=nn.Linear(self.ffn_dim,self.embed_dim)
        self.res_dropout=nn.Dropout(self.dropout)
    def forward(self,X):
        res= self.gate(X)
        res= F.silu(res) + self.up_proj(X)
        res= self.down_proj(res)
        res= self.res_dropout(res)
        return res
# Example usage
embed_dim = 10  # 假设嵌入维度为512
ffn_dim = 2048  # 假设前馈网络维度为204
dropout = 0.1  # 假设dropout率为0.1
ffn_layer = FeedForward(embed_dim, ffn_dim, dropout)
input_ids=torch.rand(2, 1, embed_dim)  # 假设输入的X为2个句子，每个句子10个token，每个token的嵌入维度为512
ffn_output = ffn_layer(input_ids)
print(input_ids)  # 输出形状应为(2, 10, 10)
print(ffn_output)  # 输出嵌入后的向量
print(ffn_output.shape)  # 输出形状应为(2, 10, 10)

tensor([[[0.4633, 0.2942, 0.3766, 0.3820, 0.4669, 0.8147, 0.9017, 0.5830,
          0.8753, 0.4055]],

        [[0.4591, 0.6238, 0.5254, 0.7336, 0.1214, 0.4545, 0.3634, 0.3871,
          0.8580, 0.5933]]])
tensor([[[-0.5581,  0.0881,  0.0698,  0.3024,  0.3239, -0.0000,  0.1254,
           0.3087,  0.1068, -0.0631]],

        [[-0.3845,  0.2091, -0.0874,  0.5110,  0.2865, -0.0427,  0.1898,
           0.0488,  0.0000, -0.0000]]], grad_fn=<MulBackward0>)
torch.Size([2, 1, 10])


### 3.搭建Block
![llm](../images/LLM-structure.png)

In [40]:
class MiniMind_Block_Dense(nn.Module):
    def __init__(self,block_id,embed_dim,head_num,kv_head_num,ffn_dim,attn_dropout=0.1,ffn_dropout=0.1,Flash=False,max_seqlen=1024):
        super(MiniMind_Block_Dense,self).__init__()
        ## 基本属性
        self.block_id=block_id
        self.embed_dim=embed_dim
        self.head_num=head_num
        self.kv_head_num=kv_head_num
        self.head_dim=embed_dim//head_num
        self.rep_num=self.head_num//self.kv_head_num
        assert self.rep_num * self.kv_head_num == self.head_num, "head_num must be divisible by kv_head_num"
        assert embed_dim == head_num * self.head_dim,"embed_dim must be divisible by head_num"

        self.ffn_dim=ffn_dim
        self.attn_dropout=attn_dropout
        self.ffn_dropout=ffn_dropout
        self.Flash=Flash
        self.max_seqlen=max_seqlen

        self.attn=GroupQueryAttention(embed_dim,head_num,kv_head_num,attn_dropout,Flash=Flash,max_seqlen=max_seqlen)
        self.ffn=FeedForward(embed_dim,ffn_dim,ffn_dropout)
        self.norm1=RMSNorm(embed_dim)
        self.norm2=RMSNorm(embed_dim)

        ## 临时参数
        pos_cis=precompute_pos_cis(self.head_dim,self.max_seqlen)
        self.register_buffer("pos_cis",pos_cis)

    def forward(self,X,
                pos_cis=None,
                use_cache=False,
                past_key_value=None,
                ):
        """
        X: (bs,seqlen,embed_dim)
        pos_cis: (seqlen, embed_dim//2)
        use_cache: 是否使用缓存
        past_key_value: 缓存的键值对
        """
        # 1. Attention
        X=self.norm1(X)
        attn_output,past_key_value=self.attn(X,pos_cis=pos_cis,use_cache=use_cache,past_key_value=past_key_value)
        X=X+attn_output

        # 2. FFN
        X=self.norm2(X)
        ffn_output=self.ffn(X)
        X=X+ffn_output

        return X,past_key_value

# Example usage
embed_dim = 512  # 假设嵌入维度为512
head_num = 8  # 假设头数为8
kv_head_num = 4  # 假设键值头数为4
ffn_dim = 2048  # 假设前馈网络维度为204
attn_dropout = 0.1  # 假设注意力dropout率为
ffn_dropout = 0.1  # 假设前馈网络dropout率为0.1
max_seqlen = 1024  # 假设最大序列长度为102
block = MiniMind_Block_Dense(0,embed_dim, head_num, kv_head_num, ffn_dim, attn_dropout, ffn_dropout, max_seqlen=max_seqlen)
input_ids = torch.rand(2, 10, embed_dim)  # 假设输入的X为2个句子，每个句子10个token，每个token的嵌入维度为512
output, past_key_value = block(input_ids, use_cache=True)
print(output.shape)  # 输出形状应为(2, 10,

torch.Size([32])
torch.Size([1024])
torch.Size([1024, 32])
torch.Size([32])
torch.Size([10])
torch.Size([10, 32])
torch.Size([2, 10, 512])


### 4.MiniMind_Dense
拼成一个Pipeline

In [48]:
class MinimindLM_Dense(nn.Module):
    def __init__(self,block_num,vocab_size,embed_dim,head_num,kv_head_num,ffn_dim,attn_dropout=0.1,ffn_dropout=0.1,Flash=False,max_seqlen=1024):
        super(MinimindLM_Dense,self).__init__()
        self.block_num=block_num
        self.vocab_size=vocab_size
        self.embed_dim=embed_dim
        self.head_num=head_num
        self.kv_head_num=kv_head_num
        self.ffn_dim=ffn_dim
        self.attn_dropout=attn_dropout
        self.ffn_dropout=ffn_dropout
        self.Flash=Flash
        self.max_seqlen=max_seqlen

        self.embedding=Embedding(self.vocab_size,embed_dim)
        self.blocks=nn.ModuleList([MiniMind_Block_Dense(i,embed_dim,head_num,kv_head_num,ffn_dim,attn_dropout,ffn_dropout,Flash,max_seqlen) for i in range(block_num)])
        self.norm=RMSNorm(embed_dim)
        self.lm_output=nn.Linear(embed_dim,vocab_size)
    def forward(self,input_ids):
        bs,seqlen = input_ids.shape
        X=self.embedding(input_ids)  # (bs,seqlen,embed_dim)
        past_kv=None
        for i,block in enumerate(self.blocks):
            X,past_kv=block(X,past_key_value=past_kv,use_cache=False)
        X=self.norm(X)
        lm_logits=self.lm_output(X)  # (bs,seqlen,vocab_size)
        lm_logits=F.softmax(lm_logits,dim=-1)  # 应用softmax得到概率分布
        return lm_logits
# Example usage
block_num = 12  # 假设有12个Block
vocab_size = 128000  # 假设词表大小为6400
embed_dim = 1024  # 假设嵌入维度为512
head_num = 8  # 假设头数为8
kv_head_num = 4  # 假设键值头数为4
ffn_dim = 2048  # 假设前馈网络维度为204
attn_dropout = 0.1  # 假设注意力dropout率为
ffn_dropout = 0.1  # 假设前馈网络dropout率为0.1
max_seqlen = 1024  # 假设最大序列长度为102
model = MinimindLM_Dense(block_num*3, vocab_size, embed_dim, head_num, kv_head_num, ffn_dim, attn_dropout, ffn_dropout, max_seqlen=max_seqlen)
input_ids = torch.randint(0, vocab_size, (2, 10))  # 假设输入的input_ids为2个句子，每个句
lm_logits = model(input_ids)
print(lm_logits.shape)  # 输出形状应为(2, 10,
print(lm_logits)  # 输出嵌入后的向量

torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64])
torch.Size([64])
torch.Size([1024])
torch.Size([1024, 64

# 至此我们torch原生dense模型搭建完毕
统计下model的参数量吧

In [51]:
total_params = sum(p.numel() for p in model.parameters())
print(f"总参数量: {total_params/1e9:,} B")

总参数量: 0.602380288 B
