# 准备数据集

我们将《三国演义》的原文作为数据集, 来训练一个字符级别的语言模型. 也就是将原文中的汉字以及标点符号等等映射成整型(`int`).

In [333]:
# 导入需要的包
import pickle
import numpy as np

In [334]:
# 打开《三国演义》的文本文件`input.txt`
# 然后读取
with open('input.txt', 'r') as f:
    data = f.read()

# 打印《三国演义》中的字符数量
print(f"字符数据集的长度: {len(data):,}")

字符数据集的长度: 605,548


In [335]:
# 计算《三国演义》中有多少个不同的字符

# set去重 --> list转换成列表 --> 排序
chars = sorted(list(set(data)))
# 不同字符的数量
vocab_size = len(chars)

print("所有不同的字符:", ''.join(chars))
print(f"不同字符数量: {vocab_size:,}")

所有不同的字符: 
 <>[]—‘’“”…□　、。《》【】一丁七万丈三上下不与丐丑专且丕世丘丙业丛东丝丞丢两严丧个中丰临丸丹为主丽举乂乃久么义之乌乎乏乐乔乖乘乙九乞也习乡书买乱乳乾了予争事二于亏云互五井亘亚些亟亡亢交亥亦产亨亩享京亭亮亲亵亹人什仁仅仆仇今介仍从仓仔仕他仗付仙仞代令以仪们仰仲件价任仿伉伊伍伎伏伐休众优伙会伞伟传伤伦伪伯伴伷伸伺似但位低住佐佑体何佗余佛作佞你佣佥佩佯佳佻使侄侈例侍供依侠侥侧侪侮侯侵便促俄俊俎俗俘保俞俟信俦俨俭修俯俱俸俺俾倅倍倏倒倘候倚借倡倥倦值倾偃假偎偏偕做停健偬偶偷偿傅傍傕储催傲像僚僧僭僮僵僻儁儒儿兀允元兄充兆先光克免兔兖党兜兢入全八公六兮兰共关兴兵其具典兹养兼兽冀内冈册再冒冓冕冗写军农冠冢冤冥冬冯冰冲决况冶冷冻净凄准凉凋凌减凑凛凝几凡凤凭凯凰凳凶凹出击函凿刀刁刃分切刈刎刑划刖列刘则刚创初判利别刮到制刺刻刽剁剂削前剐剑剔剖剜剥剧剩剪副割剽剿劈力劝办功加务劣动助努劫劬劭励劲劳劾势勃勇勉勋勑勒勖勘募勤勺勾勿匄包匆匍匐化北匙匝匠匡匣匪匮匹区医匿十千升午半华协卑卒卓单卖南博卜卞占卢卣卤卦卧卫卯印危即却卵卷卸卿厄厅历厉压厌厔厕厘厚原厢厥厦厨厮去县参又叉及友双反发叔取受变叙叛叟叠口古句另叨叩只叫召叮可台叱史右叵叶号司叹吁吃各合吉吊同名后吏吐向吓吕君吝吞吟吠否含听启吴吸吹吻吼吾呀呆呈告呐呕员呜呦周味呵呻呼命咆和咎咏咐咒咛咥咨咫咬咸咽哀品哂哄哉响哑哙哥哨哩哭哮哲哺哽唆唇唐唤唬唯唱唾唿商啕啖啜啸啼喂喃善喈喉喊喏喘喜喝喟喧喨喷喻嗓嗔嗜嗟嗣嗤嘉嘏嘤嘱嘴嘶嘹噀噎噤器噪噫噬嚎嚷嚼囊囚四回因团囧园困围囷固国图圃圆圈土圣在圭地场坂均坊坌坎坏坐坑块坚坛坞坟坠坡坤坦垂垒垓垕垛垠垢垣垦垫埃埋城域基堂堆堑堕堤堪堰堵塌塑塔塘塞填墀境墉墓墙增墟墨墩墵壁壎壑壕壤士壬壮声壳壶处备复夏夔夕外夙多夜够夤夥大天太夫夭央失头夷夸夹夺奁奂奄奇奈奉奋奎奏契奔奕奖套奚奠奢奥女奴奸好如妃妄妆妇妒妓妖妙妥妨妫妹妻妾姊始姐姑姓委姚姜姬姻姿威娄娇娘娥娩娱娴娶娼婆婉婚婢婴婿媒媚嫁嫂嫉嫌嫔嫡嫩嬉嬖嬴子孑孔孕字存孙孚孝孟季孤孥学孩孰孱孺孽宁宄宅宇守安宋完宏宓宕宗官宙定宛宜宝实宠审客宣室宥宦宪宫宰害宴宵家容宽宾宿寂寄寅密寇富寐寒寓寔寝寞察寡寤寨寮寰寸对寺寻导寿封射将尉尊小少尔尖尘尚尝尤尧尪就尸尹尺尼尽尾局层居屈屋屏屑展属屠屡履屦屯山岁岂岌岐岑岖岗岘岛岩岭岱岳岷岸峙峡峨峪峭峰峻崇崎崔崖崤崦崩嵋嵌嵩嵯嶲嶷巅巍川州巡

In [336]:
# 创建从字符到整数的映射

# 从字符到整数的映射字典
stoi = { ch:i for i,ch in enumerate(chars) }

# 从整数到字符的映射字典
itos = { i:ch for i,ch in enumerate(chars) }

# 例如我们可以看一下`鼻`这个字对应的整数
print(stoi['鼻'])

3934


In [338]:
# 给定一个字符串`s`, 输入字符串中每个字对应的整数组成的列表
def encode(s):
    return [stoi[c] for c in s]

# 给定一个整数列表, 返回列表中每个整数对应的字符所组成的字符串
def decode(l):
    return ''.join([itos[i] for i in l])

# 测试一下
print(encode('滚滚长江东逝水'))
print(decode([2044, 2044, 3600, 1881, 40, 3429, 1871]))

[2044, 2044, 3600, 1881, 40, 3429, 1871]
滚滚长江东逝水


In [314]:
# 切分数据集
# 将《三国演义》前90%的文字作为训练数据集
n = len(data)
train_data = data[:int(n*0.9)]
# 将《三国演义》后10%的文字作为验证数据集
val_data = data[int(n*0.9):]

In [315]:
# 分别将训练数据集中的字符和验证数据集中的字符编码成整数
train_ids = encode(train_data)
val_ids = encode(val_data)

print(f"训练数据集中有 {len(train_ids):,} 个字符(token)")
print(f"验证数据集中有 {len(val_ids):,} 个字符(token)")

训练数据集中有 544,993 个字符(token)
验证数据集中有 60,555 个字符(token)


In [316]:
# 将训练数据集和验证数据集分别保存成二进制文件
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile('train.bin')
val_ids.tofile('val.bin')

In [317]:
# 将元数据保存成pickle格式的文件, 供我们后面在encode或者decode时使用
meta = {
    'vocab_size': vocab_size,
    'itos': itos,
    'stoi': stoi,
}
with open('meta.pkl', 'wb') as f:
    pickle.dump(meta, f)

我们数据准备的工作就完成了.

# 编写GPT模型

接下来我们开始编写模型代码

In [318]:
# 首先导入需要的一些包
import math
import inspect
from dataclasses import dataclass

import torch
import torch.nn as nn
from torch.nn import functional as F

## GeLU激活函数

公式如下:

In [319]:
# 定义GELU激活函数, 具体论文参见:
# https://arxiv.org/abs/1606.08415
def new_gelu(x):
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

## 层归一化模块

In [320]:
# 定义层归一化模块
class LayerNorm(nn.Module):
    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
    
    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

## 因果自注意力机制模块

In [321]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        # 确保词嵌入向量的维度是head数量的整数倍
        assert config.n_embd % config.n_head == 0
        # 下面的线性变换计算的是:
        # 在一批中, 所有头的key, query, value的投影
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # 正则化(regularization)
        self.resid_dropout = nn.Dropout(config.dropout)
        # 下面的线性变换的作用是将投影输出
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # head的数量
        self.n_head = config.n_head
        # 词嵌入向量的维度
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        
    def forward(self, x):
        """
        定义因果自注意力模块在接收到张量x时, 输出什么样的张量
        """
        B, T, C = x.size() # 批的大小(batch size), 序列长度(sequence length), 词嵌入向量维度(n_embd)
        
        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        
        y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout, is_causal=True)
        
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
        
        # output projection
        y = self.resid_dropout(self.c_proj(y))

        return y

## 多层感知机模块

In [322]:
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = new_gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

## Block模块

![](assets/Block模块示意图.svg)

In [323]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

## GPT模型的一些参数配置

In [324]:
@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster

## GPT模型的实现

In [325]:
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config
        
        # Transformer模块
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight
        
        # 初始化所有权重
        self.apply(self._init_weights)
        
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
        
        # 打印模型的参数数量
        print("参数数量: %.2fM" % (self.get_num_params() / 1e6,))
        
    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count (default), the position embeddings get subtracted.
        The token embeddings would too, except due to the parameter sharing these
        params are actually used as weights in the final layer, so we include them.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.2)
                
    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"无法前馈(向前发送)序列长度: {t}, 因为block size只有{self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)
        
        tok_emb = self.transformer.wte(idx) # token嵌入向量的形状 (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # 位置嵌入向量的形状 (1, t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        
        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None
        
        return logits, loss
    
    def crop_block_size(self, block_size):
        # model surgery to decrease the block size if necessary
        # e.g. we may load the GPT2 pretrained model checkpoint (block size 1024)
        # but want to use a smaller block size for some smaller, simpler model
        assert block_size <= self.config.block_size
        self.config.block_size = block_size
        self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
        for block in self.transformer.h:
            block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]
    
    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        """
        This long function is unfortunately doing something very simple and is being very defensive:
        We are separating out all parameters of the model into two buckets: those that will experience
        weight decay for regularization and those that won't (biases, and layernorm/embedding weights).
        We are then returning the PyTorch optimizer object.
        """

        # separate out all parameters to those that will and won't experience regularizing weight decay
        decay = set()
        no_decay = set()
        whitelist_weight_modules = (torch.nn.Linear, )
        blacklist_weight_modules = (torch.nn.LayerNorm, LayerNorm, torch.nn.Embedding)
        for mn, m in self.named_modules():
            for pn, p in m.named_parameters():
                fpn = '%s.%s' % (mn, pn) if mn else pn # full param name
                # random note: because named_modules and named_parameters are recursive
                # we will see the same tensors p many many times. but doing it this way
                # allows us to know which parent module any tensor p belongs to...
                if pn.endswith('bias'):
                    # all biases will not be decayed
                    no_decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                    # weights of whitelist modules will be weight decayed
                    decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                    # weights of blacklist modules will NOT be weight decayed
                    no_decay.add(fpn)

        # subtle: 'transformer.wte.weight' and 'lm_head.weight' are tied, so they
        # will appear in the no_decay and decay sets respectively after the above.
        # In addition, because named_parameters() doesn't return duplicates, it
        # will only return the first occurence, key'd by 'transformer.wte.weight', below.
        # so let's manually remove 'lm_head.weight' from decay set. This will include
        # this tensor into optimization via transformer.wte.weight only, and not decayed.
        decay.remove('lm_head.weight')

        # validate that we considered every parameter
        param_dict = {pn: p for pn, p in self.named_parameters()}
        inter_params = decay & no_decay
        union_params = decay | no_decay
        assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
        assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                    % (str(param_dict.keys() - union_params), )

        # create the pytorch optimizer object
        optim_groups = [
            {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": weight_decay},
            {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
        ]
        # new PyTorch nightly has a new 'fused' option for AdamW that is much faster
        use_fused = (device_type == 'cuda') and ('fused' in inspect.signature(torch.optim.AdamW).parameters)
        print(f"using fused AdamW: {use_fused}")
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)

        return optimizer
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

# 训练模型

由于配置问题, 我们使用CPU来训练GPT模型.

In [326]:
from contextlib import nullcontext
import pickle
import numpy as np
import torch

eval_interval = 250
log_interval = 1
eval_iters = 20
eval_only = False # if True, script exits right after the first eval
always_save_checkpoint = False # if True, always save a checkpoint after each eval
# data
gradient_accumulation_steps = 5 # used to simulate larger batch sizes
batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
block_size = 64
# model
n_layer = 4
n_head = 4
n_embd = 128
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False # do we use bias inside LayerNorm and Linear layers?
# adamw optimizer
learning_rate = 1e-3 # max learning rate
max_iters = 2000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.99
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 100 # how many steps to warm up for
lr_decay_iters = 2000 # should be ~= max_iters per Chinchilla
min_lr = 1e-4 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla
# system
device = 'cpu' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = 'bfloat16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
# -----------------------------------------------------------------------------
config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
config = {k: globals()[k] for k in config_keys} # will be useful for logging
# -----------------------------------------------------------------------------

gradient_accumulation_steps *= 8 # simulate 8 gpus

torch.manual_seed(1337)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cpu' # for later use in torch.autocast
ctx = nullcontext()

train_data = np.memmap('train.bin', dtype=np.uint16, mode='r')
val_data = np.memmap('val.bin', dtype=np.uint16, mode='r')

In [327]:
# 定义切分数据集为输入和标签的函数
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    
    x, y = x.to(device), y.to(device)
    
    return x, y

In [328]:
iter_num = 0 # 迭代次数
best_val_loss = 1e9 # 最佳损失

with open('meta.pkl', 'rb') as f:
    meta = pickle.load(f)

meta_vocab_size = meta['vocab_size']
print(f"found vocab_size = {meta_vocab_size} (inside 'meta.pkl')")

found vocab_size = 3951 (inside 'meta.pkl')


In [329]:
# model init
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=meta_vocab_size, dropout=dropout) # start with model_args from command line

# 从头开始训练模型
print("从头开始训练模型")
gptconf = GPTConfig(**model_args)
model = GPT(gptconf)

# crop down the model block size if desired, using model surgery
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size # so that the checkpoint will have the right value

model.to(device)
# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))
# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device)

X, Y = get_batch('train') # fetch the very first batch
print(decode(X[1].tolist()))
print(decode(Y[1].tolist()))

从头开始训练模型
参数数量: 1.29M
using fused AdamW: False
忧臣辱。某久事袁氏，岂可背之！”操知其不可留，乃遣回。评回见谭，言操不准投降。谭叱曰：“汝弟现事曹操，汝怀二心耶？”评闻言，气满
臣辱。某久事袁氏，岂可背之！”操知其不可留，乃遣回。评回见谭，言操不准投降。谭叱曰：“汝弟现事曹操，汝怀二心耶？”评闻言，气满填


In [330]:
# 估算损失
# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [331]:
# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

In [332]:
import time

t0 = time.time()
local_iter_num = 0 # number of iterations in the lifetime of this process

# 训练代码
while True:

    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"将模型的检查点文件保存到 'ckpt.pt'")
                torch.save(checkpoint, 'ckpt.pt')

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        with ctx:
            logits, loss = model(X, Y)
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0:
        lossf = loss.item() # loss as float. note: this is a CPU-GPU sync point
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break

step 0: train loss 8.3115, val loss 8.3091
iter 0: loss 8.3285, time 2750.88ms
iter 1: loss 8.3125, time 1935.22ms
iter 2: loss 8.3172, time 1945.68ms
iter 3: loss 8.3049, time 1922.65ms
iter 4: loss 8.3172, time 1912.10ms
iter 5: loss 8.2958, time 2022.62ms
iter 6: loss 8.3068, time 2186.26ms
iter 7: loss 8.3106, time 2258.78ms
iter 8: loss 8.3074, time 2028.46ms
iter 9: loss 8.3037, time 2096.54ms
iter 10: loss 8.2843, time 2063.59ms
iter 11: loss 8.3023, time 2249.82ms
iter 12: loss 8.2935, time 2345.97ms
iter 13: loss 8.2923, time 2577.94ms
iter 14: loss 8.2826, time 2464.92ms
iter 15: loss 8.2893, time 2327.37ms
iter 16: loss 8.2691, time 2392.97ms
iter 17: loss 8.2718, time 2459.26ms
iter 18: loss 8.2606, time 2268.73ms
iter 19: loss 8.2513, time 2492.58ms
iter 20: loss 8.2384, time 2398.60ms
iter 21: loss 8.2184, time 2443.73ms
iter 22: loss 8.2092, time 2283.98ms
iter 23: loss 8.2103, time 2434.89ms
iter 24: loss 8.1674, time 2371.96ms
iter 25: loss 8.1190, time 2269.50ms
iter 

KeyboardInterrupt: 

# 生成数据

In [280]:
import pickle
from contextlib import nullcontext

start = "\n"
num_samples = 10
max_new_tokens = 500
top_k = 200
temperature = 0.8
seed = 1337
device = 'cpu'

torch.manual_seed(seed)
device_type = 'cpu'
ctx = nullcontext()

checkpoint = torch.load('ckpt.pt', map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)

model.eval()
model.to(device)

with open('meta.pkl', 'rb') as f:
    meta = pickle.load(f)

stoi, itos = meta['stoi'], meta['itos']
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

start_ids = encode(start)
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

# run generation
with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('---------------')

参数数量: 1.29M

昔令马于大死曰：“非孙操　　　却有吴，即德曰：“既言。”操命曰：“张操曰：“然大。”遂在：“孔明不可不曰：“吾以自遣无等要见？”庞德不令出？：“汝何将请王不汉曰：“吾以某近曰：“汝说公曰：“必不乃能退，并拜先闻何将，吾有是也，今亦毕，故日曰：“吾者视者。”遂多可，如言。”睿曰：“将，必说诸说诸从，以言，便在功，张操告意：“吾之，反之。”自不将许夫一相有延取今尚必退不知。”孔明汝说尚云以延如兵，愿故人，留相欲受军接，待将。”张明曰：“此！”次是兵。”玄德长子已进，便欲能无是备曰：“吾弟有可将乃用大不人曰：“又弟，必若有吾在。”遂欲不主将后先日，反？”众已得，刘孔　明曰：“，令。”玄德主之皆，方见前多。”且吾公无可多，若肃曰：“叱曰：“闻玄德为何如及不十道中不来。”孔操曰：“一曾人，恐、公马也。”孔布曰：“玄德闻父士，早能被魏一可先人间，主后其可兵，吾得来？”玄德乃在吾日意。背云。
　　　　　　　　　　　
　明汝知是相报延？”且何欲二是此在太亮相引是汝今更曰：“汝说王马，吾将一将兵也。吾不有等来公用之便听取主何人，今瑜安看。”公与我到。今见来。”绍引我大锋兵，然，有江人为夫于葛，乃瑜、刘瑜
---------------

今欲士，可唤主主视将徐，出，望来有喜。后：“蜀从长多意，若可为此下玄德告，一一杀，大可当军，不思军，不大定。懿；。”谁，上见上，主引军，不闻关之军二有东书，即长公一发，不得来，与太之？后言。吾无夜，只引见军？”玄姜亮必言于，当且复今不取说：“此二，皆在城首，背瑜曰：“吾急，非知后。”且此领兵，何，可然曰：“吾诸军不可辞至，何在此之？”玄德、绍、吾曰：“袁孔　次行。”融曰：“此人一知，不可杀，皆在手，纵官曰：“汝只报说袁正在喜，以喝自为子州曰：“说权之。”吾报臣之战，先五才分曰：“说我，马曰：“赵仁书，必兄亲曰：“孔此兵，汝肃曰：“必弟，亦令，以吾日曰：“玄明曰：“吾见也。”玄德曰：“是吾吾欲欲令若何欲有千，何要又与张原军于之袁公曰：“吾如待出，乃何人，乃主超一敢长引人。今正使王一十之，我曹统曰：“欲有要大问、张玄德，不惊，回。”苞不得至武伏曰：“蜀也，我入而得之处之之，乃说其入无如说此马来，不当在后奔赶见后言，方杀与不知。”何引前，且一人。”遂曰：“公曰：“遂不再能人曰：“瑜！”辽、张操曰：“孔德曰：“四知兵报鲁、孟曰：“闻