# HW4P1:语言模型的【预测】和【生成】


### 1. 作业概述
这是课程的第一部分作业，主要是使用PyTorch进行训练。你将致力于训练语言模型，并对模型进行预测和生成任务的评估。

### 2. 注意事项
在完成作业的过程中，请仔细阅读代码，并关注待办事项（TODOs）。

### 3. Jupyter Notebook的结构
该笔记本的结构如下：

- **Imports and installs（导入和安装）**：指定正确的数据路径，主要是运行相关代码。
- **Datasets（数据集）**：完成TODO部分并运行。
- **Dataloader（数据加载器）**：完成TODO部分并运行。
- **Language model architecture（语言模型架构）**：根据写作要求，实施并定义你喜欢的模型架构。
- **Dataloader, model, loss, optimizer, and scheduler definition（数据加载器、模型、损失函数、优化器和调度器的定义）**：定义数据加载器、模型、损失函数、优化器和调度器。
- **Trainer class（训练类）**：与所有P2不同，我们为这次作业使用了Trainer类，需检查该类并完成训练函数。
- **Wandb**：添加正确的API密钥。
- **Experiments（实验）**：运行实验并记录最终的NLL（负对数似然）指标。
- **Evaluation（评估）**：访问OpenAI API以获取最终的困惑度（Perplexity）指标。
- **Submission（提交）**：为Autolab创建提交文件。


In [1]:
#确保所有后续的 Matplotlib 图表都能在 Notebook 中显示
%matplotlib inline 

import torch

import os

import time
import numpy as np
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm
import torchsummaryX
import gc
import wandb
import yaml
# import openai

# Importing necessary modules from hw4
# Update the path depending on how you choose to load the handout
from tests_hw4 import get_prediction_nll, make_generation_text

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Device: ", DEVICE)

Device:  cuda


# 数据集

我们将在 WikiText-2 语言建模数据集上训练 RNN 语言模型，该数据集已为 HW4p1 进行了预处理，并包含在模板压缩包中：
- vocab.npy：包含词汇表中单词的 NumPy 文件
- vocab.csv：列出词汇表的人可读 CSV 文件（ym备注： npy末尾还有SOS和EOS， csv中没有）
- wiki.train.nyy：包含训练文本的NumPy文件

我们还在 fixture 目录中提供了测试文件，这些文件将构成你将要实现的预测和生成函数的测试用例。你无需担心这些文件，处理它们的测试脚本已经为你设置好了。

train 文件包含一个文章数组。每篇文章都是一个整数数组，与词汇表中的单词相对应。训练集中有 579 篇文章。
例如，训练集中的第一篇文章包含 3803 个整数。第一篇文章的前 6 个整数是 [1420 13859 3714 7036 1420 1417]。在词汇表中查找这些整数，会发现第一行是： = Valkyria Chronicles III = \<eol>


In [2]:
VOCAB       = np.load('dataset/vocab.npy')

# We have also included <sos> and <eos> in the vocabulary for you
# However in real life, you include it explicitly if not provided
SOS_TOKEN   = np.where(VOCAB == '<sos>')[0][0] #获取第一个满足条件的索引值
EOS_TOKEN   = np.where(VOCAB == '<eos>')[0][0]
NUM_WORDS   = len(VOCAB) - 2 # Actual number of words in vocabulary

print("Vocab length: ", len(VOCAB))
print(VOCAB)

Vocab length:  33280
['!' '"' '#' ... '～' '<sos>' '<eos>']


In [3]:
SOS_TOKEN

33278

In [4]:
EOS_TOKEN

33279

In [5]:
# Loding the training dataset. Refer to write up section 2 to understand the structure
dataset     = np.load('dataset/wiki.train.npy', allow_pickle=True)

# 首先打印数据集的一部分，查看数据结构
print(dataset[:2])  # 查看前2个样本

# The dataset does not have <sos> and <eos> because they are just regular articles.
# TODO: Add <sos> and <eos> to every article in the dataset.
# Before doing so, try printing the dataset to see if they are words or integers.

[array([ 1420, 13859,  3714, ...,   813,    79,  1417])
 array([ 1420, 13463,  3117, ...,  8635,    79,  1417])]


In [6]:
# 使用numpy的concatenate方法将所有数组连接起来，然后计算长度
total_elements = np.concatenate(dataset).size

print("总元素个数:", total_elements)

总元素个数: 2075677


In [7]:
# 为每篇文章首尾添加 <sos> 和 <eos> 标记
dataset = [np.concatenate(([SOS_TOKEN], article, [EOS_TOKEN])) for article in dataset]

# 检查是否成功加入
print(dataset[:2])  # 查看前2个样本

[array([33278,  1420, 13859, ...,    79,  1417, 33279], dtype=int64), array([33278,  1420, 13463, ...,    79,  1417, 33279], dtype=int64)]


In [8]:
# Loading the fixtures for validation and test - prediction
fixtures_pred       = np.load('fixtures/prediction.npz')        # validation
fixtures_pred_test  = np.load('fixtures/prediction_test.npz')   # test

print("Validation shapes    : ", fixtures_pred['inp'].shape, fixtures_pred['out'].shape)
print("Test shapes          : ", fixtures_pred_test['inp'].shape)

Validation shapes    :  (128, 21) (128,)
Test shapes          :  (128, 21)


In [9]:
# Loading the fixtures for validation and test - generation
fixtures_gen        = np.load('fixtures/generation.npy')        # validation
fixtures_gen_test   = np.load('fixtures/generation_test.npy')   # test

print("Validation Gen Shapes    :", fixtures_gen.shape)
print("Test Gen Shapes          :", fixtures_gen_test.shape)

Validation Gen Shapes    : (32, 21)
Test Gen Shapes          : (128, 31)


In [10]:
# Example Prediction Dev Input and Output
# Optional TODO: You can try printing a few samples from the validation set which has both inputs and outputs

# 打印验证集的前几个样本
num_samples_to_print = 5  # 你想要打印的样本数量

# 遍历并打印指定数量的样本
for i in range(num_samples_to_print):
    print(f"Sample {i+1}:")
    
    print("Input (inp):", fixtures_pred['inp'][i])

    
    # 将输入序号转换为对应的英文单词或字符
    input_words = [VOCAB[idx] for idx in fixtures_pred['inp'][i]]
    print("Input (inp):", ' '.join(input_words))  # 转换成字符串并打印
    
    # 打印输出标签
    print("Output (out):", fixtures_pred['out'][i])
    print("Output (out):", VOCAB[fixtures_pred['out'][i]])
    
    print("-" * 50)  # 分隔线，方便阅读


Sample 1:
Input (inp): [33278 26096 26972 25821 14658 29325 32935 21820 25639 16134 31353 29092
    79  6916    76 21415 14658 24911  1424 29456 29325]
Input (inp): <sos> output port of a section will generally not be the same . However , for a mid @-@ series section
Output (out): 72
Output (out): (
--------------------------------------------------
Sample 2:
Input (inp): [33278 14658 21076 21626 31353  6613  1419 10706 15340 25874 25949 31994
 21626  2299  3952    79  1419    76  1184 31543  1242]
Input (inp): <sos> a few from the Heavy <unk> Platoon and one or two from B Company . <unk> , 60 to 70
Output (out): 24820
Output (out): men
--------------------------------------------------
Sample 3:
Input (inp): [33278  1419 15219 27351 25131 21415 32352 25871 31353 28863    76 31353
 21201 31994 25821 32883 19278 21626 31353 25806  1424]
Input (inp): <sos> <unk> also produced monitors for use on the rivers , the first two of which differed from the ocean @-@
Output (out): 21959
Output (o

# Dataloader

In [11]:
class DataLoaderForLanguageModeling(torch.utils.data.DataLoader): # 继承自 torch.utils.data.DataLoader
    """
        TODO: 在这里定义数据加载器的逻辑
    """
    # TODO: 你可能还需要添加更多的参数。例如：序列长度（sequence length）
    def __init__(self, dataset, batch_size, sequence_length = 10, shuffle=True, drop_last=False):

        # 如果你还记得，这是定义数据加载器时需要提供的标准参数。
        # 现在你只是自定义了你自己的数据加载器。
        self.dataset = dataset  # 数据集，通常是由多个文章组成的列表
        self.batch_size = batch_size  # 指每个批次（batch）中包含的序列（样本）的数量。每个序列的长度由 sequence_length 决定。
        self.shuffle = shuffle  # 是否在每次迭代时打乱数据
        self.drop_last = drop_last  # 是否丢弃最后一个不完整的批次
        self.sequence_length = sequence_length  # 序列长度，决定每个样本的长度
        


    def __len__(self):
        # 当你打印 len(loader) 时，你得到的是什么输出？你得到的是批次数量。
        # 你的数据集有 (579, ) 篇文章，每篇文章包含指定数量的单词。
        # 你将数据集连接起来，然后根据序列长度对其进行分批处理。

        total_length = np.sum([self.dataset[i].shape[0] for i in range(len(self.dataset))])  # 计算数据集中所有单词的总长度
        batch_count = total_length // (self.batch_size * self.sequence_length)  
        return batch_count  # 计算数据集中可以生成多少个完整的批次。这个值用于控制训练的循环次数

    def __iter__(self):
        # TODO: 如果 shuffle 为 True，打乱数据
        if self.shuffle:
            # TODO
            np.random.shuffle(self.dataset)  # 打乱数据集中的文章顺序

        # TODO: 设置批次数量
        num_batches = self.__len__()  # 获取批次数量
        
        batches = []  # 初始化批次列表
        
        self.dataset_concatenated = np.concatenate(self.dataset)  # 将数据集中的所有文章连接成一个长序列
        for b in range(num_batches):
            batch = self.dataset_concatenated[b*self.batch_size*self.sequence_length:(b+1)*self.batch_size*self.sequence_length + 1]  # 切片获取每个批次的数据
            batches.append(batch)  # 将每个批次的数据添加到批次列表中
            
        if self.drop_last:
            batches = batches[:-1]  # 如果 drop_last 为 True，丢弃最后一个不完整的批次
            

        # TODO: 将连接的数据集划分为输入和目标。它们如何变化？

        # TODO: 将输入和目标重塑为批次（考虑最终的形状）

        # TODO: 遍历批次并根据序列长度生成输入和目标批次
        batch_idx = 0  # 初始化批次索引
        
        while batch_idx < batches.__len__(): #遍历 batches 列表中的每个批次
            # [:-1] 表示去除当前批次中的最后一个元素, 因为在语言模型中，输入序列不需要包含最后一个元素（这是因为目标序列是要预测下一个词的）。
            input = batches[batch_idx][:-1].reshape(self.batch_size, self.sequence_length)  # 重塑输入的形状
            target = batches[batch_idx][1:].reshape(self.batch_size, self.sequence_length)  # 重塑目标的形状
            batch_idx += 1  # 增加批次索引
            yield torch.tensor(input), torch.tensor(target)  # 使用 yield 返回输入和目标张量


### 对dataloder的备注

1. **输入与目标的对齐**：
   - 在语言模型中，输入序列用于预测目标序列。例如，输入是 `[The cat sat on the]`，目标是 `[cat sat on the mat]`。这种对齐方式使得模型能够学习到如何根据前面的单词预测下一个单词。
   - 因此，输入序列去掉最后一个单词（`[:-1]`），而目标序列去掉第一个单词（`[1:]`），这样输入和目标是对齐的，输入的每个位置都对应目标的下一个位置。

2. **确保批次大小和序列长度一致**：
   - 使用 `reshape(self.batch_size, self.sequence_length)` 确保数据的形状是 `[batch_size, sequence_length]`，使得每个批次中的数据形状一致，这对于模型的批次处理非常重要。
   - 这有助于训练过程中数据的批次化处理，使得模型能够在固定大小的批次上进行训练和更新权重。

3. **生成器的使用**：
   - `yield` 生成器的使用，使得数据可以逐批次地加载和处理，而不是一次性将所有数据加载到内存中。对于大数据集，这种方法非常有效，可以节省内存并提高处理效率。

4. **`sequence_length` 的作用**：

   - **定义输入序列的长度**：`sequence_length` 决定了每个输入序列中包含多少个单词或标记。例如，如果 `sequence_length` 设置为 `10`，那么每个输入序列将包含10个单词或标记。
   - **序列切分**：对于一篇较长的文章，数据加载器会将其按 `sequence_length` 切分成多个序列。例如，如果文章有30个单词，`sequence_length` 为10，那么这篇文章将被切分成3个序列，每个序列包含10个单词。





In [12]:
# Some sanity checks

dl = DataLoaderForLanguageModeling(
    dataset     = dataset,
    batch_size  = 32,
    shuffle     = True,
    drop_last   = True,
    # Input Extra parameters here if needed
)

inputs, targets = next(iter(dl))
print(inputs.shape, targets.shape)

for x, y in dl:
    print("x: ", [VOCAB[i] for i in x[0, :]]) # 打印第一个批次的第一个序列
    print("y: ", [VOCAB[i] for i in y[0, :]])
    break

torch.Size([32, 10]) torch.Size([32, 10])
x:  ['<sos>', '=', 'Leslie', 'Andrew', '=', '<eol>', 'Brigadier', 'Leslie', '<unk>', 'Andrew']
y:  ['=', 'Leslie', 'Andrew', '=', '<eol>', 'Brigadier', 'Leslie', '<unk>', 'Andrew', 'VC']


In [13]:
a = [i for i in range(10,100)]

for i in range(5):
    print(a[i*10 : (i+1)*10 + 1])

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]


# 模型结构

In [14]:
import torch

class LanguageModel(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hid_dim):
        """
        初始化语言模型的各个组件，包括嵌入层、LSTM单元（LSTMCells）和线性投影层。
        
        参数：
        - vocab_size: 词汇表的大小，即模型能够处理的独特单词数量。
        - embed_dim: 嵌入层的维度，即每个单词在嵌入空间中的表示维度。
        - hid_dim: 隐藏层的维度，即LSTM单元中隐藏状态的大小。
        """
        super().__init__()

        # 定义嵌入层，将词汇表中的每个单词映射到一个固定大小的向量空间
        self.token_embedding = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)

        # 定义一系列LSTMCell单元，用于处理输入序列。这里使用了两个LSTMCell
        # 每个LSTMCell负责处理一个时间步的输入
        self.lstm_cells = torch.nn.Sequential(
            torch.nn.LSTMCell(input_size=embed_dim, hidden_size=hid_dim),  # 第一个LSTMCell，输入为嵌入向量，输出为隐藏状态
            torch.nn.LSTMCell(input_size=hid_dim, hidden_size=hid_dim)     # 第二个LSTMCell，输入为前一个LSTMCell的输出
        )
        self.hidden_dim = hid_dim  # 保存隐藏层的维度

        # 定义线性层，将LSTM的输出映射到词汇表的概率分布，用于预测下一个单词
        self.token_probability = torch.nn.Linear(hid_dim, vocab_size)

    def rnn_step(self, embedding, hidden_states_list):
        """
        对单个时间步的嵌入进行处理，更新LSTM单元的隐藏状态。
        
        参数：
        - embedding: 当前时间步的嵌入向量
        - hidden_states_list: LSTM单元的隐藏状态列表
        
        返回：
        - embedding: 当前时间步的隐藏状态（作为下一个LSTM单元的输入）
        - hidden_states_list: 更新后的隐藏状态列表
        """
        # 初始化当前时间步的隐藏状态和细胞状态
        hx = torch.zeros((embedding.shape[0], self.hidden_dim))  # 隐藏状态（batch_size, hidden_dim）
        cx = torch.zeros((embedding.shape[0], self.hidden_dim))  # 细胞状态（batch_size, hidden_dim）
        hidden_states_list[0] = (hx, cx)  # 存储初始的隐藏状态和细胞状态

        # 遍历所有的LSTMCell，依次更新隐藏状态
        for i in range(len(self.lstm_cells)):
            hx, cx = self.lstm_cells[i](embedding, hidden_states_list[-1])
            hidden_states_list[i] = (hx, cx)  # 更新当前LSTMCell的隐藏状态和细胞状态

        # 返回当前LSTMCell的输出（即更新后的隐藏状态）和所有LSTMCell的隐藏状态列表
        embedding = hx
        return embedding, hidden_states_list

    def predict(self, x):
        """
        对输入序列进行推理，返回最后一个时间步的概率分布。
        
        参数：
        - x: 输入的单词序列（可以是单个单词或一个序列）
        
        返回：
        - prob: 最后一个时间步的概率分布
        """
        if not torch.is_tensor(x):
            x = torch.tensor(x).long().to(DEVICE)

        with torch.inference_mode():  # 使用推理模式，不计算梯度
            prob, _ = self.forward(x)  # 调用前向传播，
            return prob[:, -1, :] #获取最后一个时间步的概率分布

    def generate(self, x, timesteps):
        """
        基于初始输入序列生成一个新序列，长度为指定的时间步数。
        
        参数：
        - x: 初始输入的单词序列
        - timesteps: 要生成的时间步数
        
        返回：
        - generated_sequence: 生成的单词序列
        """
        if not torch.is_tensor(x):
            x = torch.tensor(x).long().to(DEVICE)

        # 首先通过前向传播获取初始序列的概率分布和隐藏状态列表
        token_prob_dist, hidden_states_list = self.forward(x)
        next_token = token_prob_dist[:,-1,:].argmax(dim = -1)  # 获取最可能的下一个单词（通过取概率分布的最大值）

        next_token = next_token.reshape(token_prob_dist.shape[0], 1) # 维度[batch] → [batch, 1]
        
        generated_sequence = [next_token]  # 存储生成的单词序列
        timesteps -= 1 # 因为之前已经预测了一次下一个单词。
        with torch.inference_mode():
            for t in range(timesteps):  # 遍历每一个时间步，逐步生成单词
                # 使用上一个时间步生成的单词作为输入，更新隐藏状态
                token_prob_dist, hidden_states_list = self.forward(next_token, hidden_states_list)
                next_token = token_prob_dist.argmax(dim = -1)  # 获取当前时间步最可能的下一个单词
                generated_sequence.append(next_token)  # 将生成的单词添加到序列中

            # 将生成的单词序列堆叠为张量，形状为（batch_size, timesteps）
            generated_sequence = torch.stack(generated_sequence, dim=1)
        
        # 去除维度为1的
        generated_sequence = generated_sequence.squeeze(-1)
        
        return generated_sequence

    def forward(self, x, hidden_states_list=None):
        """
        模型的前向传播函数，处理输入序列，生成每个时间步的概率分布。
        
        参数：
        - x: 输入的单词序列，形状为（batch_size, seq_len）
        - hidden_states_list: LSTM单元的初始隐藏状态列表（如果有）
        
        返回：
        - token_prob_distribution: 每个时间步的概率分布，形状为（batch_size, seq_len, vocab_size）
        - hidden_states_list: 更新后的隐藏状态列表
        """
        batch_size, timesteps = x.shape  # 获取批次大小和序列长度

        # 存储所有时间步的概率分布
        token_prob_distribution = []
        # 初始化隐藏状态列表，如果没有提供初始隐藏状态
        hidden_states_list = [None] * len(self.lstm_cells) if hidden_states_list is None else hidden_states_list

        # 获取输入序列的嵌入表示，形状为（batch_size, seq_len, embed_dim）
        token_embeddings = self.token_embedding(x)

        # 遍历每个时间步，逐步处理输入序列
        for t in range(timesteps):
            token_embedding_t = token_embeddings[:, t, :]  # 获取当前时间步的嵌入向量

            # 通过LSTM单元处理当前时间步的嵌入，更新隐藏状态
            rnn_out, hidden_states_list = self.rnn_step(token_embedding_t, hidden_states_list)

            # 通过线性层将LSTM单元的输出映射到词汇表的概率分布
            token_prob_dist_t = self.token_probability(rnn_out)

            # 将当前时间步的概率分布添加到列表中
            token_prob_distribution.append(token_prob_dist_t)

        # 将所有时间步的概率分布堆叠为张量，形状为（batch_size, seq_len, vocab_size）
        token_prob_distribution = torch.stack(token_prob_distribution, dim=1)

        return token_prob_distribution, hidden_states_list


# trainer class

In [15]:
class Trainer:
    def __init__(self, model, loader, optimizer, criterion, scheduler, max_epochs=1, run_id='exp'):
        """
        初始化 Trainer 类，用于训练模型。

        参数：
        - model: 要训练的模型
        - loader: 数据加载器，用于获取训练数据
        - optimizer: 优化器，用于更新模型参数
        - criterion: 损失函数，用于计算训练过程中的损失
        - scheduler: 学习率调度器，用于调整学习率
        - max_epochs: 最大训练轮数，默认为1
        - run_id: 训练运行的标识符，用于保存和区分不同的实验结果
        """
        self.model = model
        self.loader = loader
        self.optimizer = optimizer
        self.criterion = criterion
        self.scheduler = scheduler

        # 用于保存训练过程中的损失和生成的结果
        self.train_losses = []
        self.val_losses = []
        self.predictions = []
        self.predictions_test = []
        self.generated_logits = []
        self.generated = []
        self.generated_logits_test = []
        self.generated_test = []
        self.epochs = 0
        self.max_epochs = max_epochs
        self.run_id = run_id

    def calculate_loss(self, out, target):
        """
        计算给定输出和目标之间的交叉熵损失。

        参数：
        - out: 模型的输出，形状为 (B, T, Vocab_size)，即批次大小、时间步和词汇表大小
        - target: 目标序列，形状为 (B, T)，即批次大小和时间步

        返回：
        - loss: 计算得到的交叉熵损失
        """
        # 将输出展平为二维张量，形状为 (B*T, Vocab_size)，保持词汇表大小不变
        out = out.reshape(-1, out.shape[-1])
        
        # 将目标展平为一维张量，形状为 (B*T)
        targets = target.reshape(-1)
        
        # 使用损失函数计算损失，损失函数会比较每个目标词汇与相应的概率分布
        loss = self.criterion(out, targets)

        return loss

    def train(self):
        """
        执行一个训练周期，遍历所有批次数据，并更新模型参数。
        """
        self.model.train()  # 将模型设置为训练模式
        self.model.to(DEVICE)  # 将模型移动到指定的设备（如 GPU）
        epoch_loss = 0  # 初始化一个变量用于累积整个周期的损失
        num_batches = 0  # 计数已处理的批次数量
        
        # 遍历数据加载器中的所有批次
        for batch_num, (inputs, targets) in enumerate(tqdm(self.loader)):

            # 清零优化器的梯度，准备计算新的梯度
            self.optimizer.zero_grad()
            
            inputs, targets = inputs.to(DEVICE), targets.to(DEVICE)
            
            out, _ = self.model(inputs)
            
            loss = self.calculate_loss(out, targets)
            
            loss.backward()
            
            # 优化器步骤，更新模型参数
            self.optimizer.step()
            
            # # 调度器步骤，根据损失调整学习率
            # self.scheduler.step(nll) #这里有问题，要放epoch层级里面，而不是batch层级里面

            num_batches += 1 

            # 将损失累加
            loss = loss.item()
            epoch_loss += loss
        
        # 计算整个周期的平均损失
        epoch_loss = epoch_loss / (batch_num + 1)
        self.epochs += 1  # 增加已完成的周期计数
        
        # 打印训练信息
        print('[TRAIN] \tEpoch [%d/%d] \tLoss: %.4f \tLr: %.6f'
              % (self.epochs, self.max_epochs, epoch_loss, self.optimizer.param_groups[0]['lr']))
        
        # 将本周期的损失添加到损失记录中
        self.train_losses.append(epoch_loss)

    def test(self):
        """
        执行模型的测试，评估模型在验证集上的表现。
        """
        self.model.eval()  # 将模型设置为评估模式
        predictions = self.model.predict(fixtures_pred['inp']).detach().cpu().numpy()  # 获取预测结果
        # self.predictions.append(predictions)

        # 计算验证损失（负对数似然）
        nll = get_prediction_nll(predictions, fixtures_pred['out'])
        ##self.val_losses.append(nll)
        
        
        generated_logits = self.model.generate(fixtures_gen, 10).detach().cpu().numpy()  # 生成10个单词的预测
        ##generated_logits_test = self.model.generate(fixtures_gen_test, 10).detach().cpu().numpy()
        
        # 生成的结果
        generated = make_generation_text(fixtures_gen, generated_logits, VOCAB)
        ##generated_test = make_generation_text(fixtures_gen_test, generated_logits_test, VOCAB)

        # self.generated.append(generated)
        # self.generated_test.append(generated_test)
        # self.generated_logits.append(generated_logits)
        # self.generated_logits_test.append(generated_logits_test)
        
        # # 生成测试数据的预测结果
        # predictions_test = self.model.predict(fixtures_pred_test['inp']).detach().cpu().numpy()
        # self.predictions_test.append(predictions_test)
        
        # 打印验证信息
        print('[VAL] \tEpoch [%d/%d] \tLoss: %.4f'
              % (self.epochs, self.max_epochs, nll))
        
        return nll, generated

    

    def save(self):
        """
        保存模型的状态字典、预测结果以及生成的结果到指定路径。
        这个方法不需要修改。
        """
        model_path = os.path.join('experiments', self.run_id, 'model-{}.pkl'.format(self.epochs))
        torch.save({'state_dict': self.model.state_dict()}, model_path)
        np.save(os.path.join('experiments', self.run_id, 'predictions-{}.npy'.format(self.epochs)), self.predictions[-1])
        np.save(os.path.join('experiments', self.run_id, 'predictions-test-{}.npy'.format(self.epochs)), self.predictions_test[-1])
        np.save(os.path.join('experiments', self.run_id, 'generated_logits-{}.npy'.format(self.epochs)), self.generated_logits[-1])
        np.save(os.path.join('experiments', self.run_id, 'generated_logits-test-{}.npy'.format(self.epochs)), self.generated_logits_test[-1])
        
        with open(os.path.join('experiments', self.run_id, 'generated-{}.txt'.format(self.epochs)), 'w') as fw:
            fw.write(self.generated[-1])

        with open(os.path.join('experiments', self.run_id, 'generated-{}-test.txt'.format(self.epochs)), 'w') as fw:
            fw.write(self.generated_test[-1])




In [16]:
# TODO: define other hyperparameters here

configs = dict(
    batch_size  = 32,
    num_epochs  = 10, # 10 or 20 epochs should be enough given the model is good
    sequence_length = 10, #定义切分文章时，每几个词构成一句话用来训练
    init_lr     = 0.001 # TODO
)

In [17]:
model       = LanguageModel(vocab_size = len(VOCAB), embed_dim = 128, hid_dim = 128)

model.to(DEVICE)

loader      = DataLoaderForLanguageModeling(dataset, configs['batch_size'], sequence_length=configs['sequence_length']) 

criterion   = torch.nn.CrossEntropyLoss() 

optimizer   = torch.optim.Adam(model.parameters(), lr=configs['init_lr'])
# TODO: Define the optimizer. Adam/AdamW usually works good for this HW

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max = configs['num_epochs'], eta_min=1e-6)

print(model)
inputs, targets = next(iter(loader))
print(inputs.shape, targets.shape)
print(loader.__len__())
import torchinfo
torchinfo.summary(model.to(DEVICE), input_data=inputs.to(DEVICE))

LanguageModel(
  (token_embedding): Embedding(33280, 128)
  (lstm_cells): Sequential(
    (0): LSTMCell(128, 128)
    (1): LSTMCell(128, 128)
  )
  (token_probability): Linear(in_features=128, out_features=33280, bias=True)
)
torch.Size([64, 10]) torch.Size([64, 10])
3245


Layer (type:depth-idx)                   Output Shape              Param #
LanguageModel                            [64, 10, 33280]           --
├─Embedding: 1-1                         [64, 10, 128]             4,259,840
├─Sequential: 1-20                       --                        (recursive)
│    └─LSTMCell: 2-1                     [64, 128]                 132,096
│    └─LSTMCell: 2-2                     [64, 128]                 132,096
├─Linear: 1-3                            [64, 33280]               4,293,120
├─Sequential: 1-20                       --                        (recursive)
│    └─LSTMCell: 2-3                     [64, 128]                 (recursive)
│    └─LSTMCell: 2-4                     [64, 128]                 (recursive)
├─Linear: 1-5                            [64, 33280]               (recursive)
├─Sequential: 1-20                       --                        (recursive)
│    └─LSTMCell: 2-5                     [64, 128]                 (recursive

In [18]:


run_id = str(int(time.time()))

if not os.path.exists('./experiments'):
    os.mkdir('./experiments')
os.mkdir('./experiments/%s' % run_id)

print("Saving models, predictions, and generated words to ./experiments/%s" % run_id)

# The object of the Trainer class takes in everything
trainer = Trainer(
    model       = model, 
    loader      = loader, 
    optimizer   = optimizer,
    criterion   = criterion, 
    scheduler   = scheduler,
    max_epochs  = configs['num_epochs'], 
    run_id      = run_id
)

Saving models, predictions, and generated words to ./experiments/1724732305


In [19]:
# Run the experiments loop. 
# Each epoch wont take more than 2-3min. If its taking more time, it might be due to (but not limited to) the following:
#   * You might be overlapping batches 
#       Eg. Input: "I had biryani for lunch today" and sequence length = 3,
#           --> "I had biryani", "for lunch today" are ideal examples for inputs
#           --> "I had biryani", "had biryani for", "biryani for lunch", ... is just redundant info :')
#   * Your length calculation in the dataloader might be wrong
# If you haven't had biryani, try it :D 

# %%time


for epoch in range(configs['num_epochs']):
    
    trainer.train()
    
    nll, generated = trainer.test()
    
    scheduler.step()
    
    print(nll)
    print(generated) #打印输入句子及其该句未来10个词的预测。
    
 

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [1/10] 	Loss: 6.5523 	Lr: 0.001000
[VAL] 	Epoch [1/10] 	Loss: 5.3584
5.3584466
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | not to the <unk> of the <unk> , and the
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = = <eol> The first time of the first
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ year @-@ year @-@ year @-@ year
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> , and the <unk>
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ year @-@ <unk> @-@ <unk> <unk> , <unk> ,
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [2/10] 	Loss: 5.8059 	Lr: 0.000976
[VAL] 	Epoch [2/10] 	Loss: 5.1629
5.1629086
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used to be a <unk> <unk> . <eol> =
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = <eol> The first time of the <unk> of
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old <unk> . <eol> = = =
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and <unk> , and <unk> , and <unk> , and
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ year @-@ old <unk> . <eol> = = =
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 2 : 23 ; twenty mi

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [3/10] 	Loss: 5.5170 	Lr: 0.000905
[VAL] 	Epoch [3/10] 	Loss: 5.0101
5.010111
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used as a <unk> , and the <unk> of
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = <eol> The first two @-@ thirds of the
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old <unk> , and the <unk> of
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> , and the <unk>
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ based on the game . <eol> = = =
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 2 : 23 ; tw

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [4/10] 	Loss: 5.3267 	Lr: 0.000794
[VAL] 	Epoch [4/10] 	Loss: 5.0033
5.0032597
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used as a <unk> , and the <unk> of
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = <eol> The first of the <unk> of the
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old version of the <unk> of the
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> <unk> , <unk> ,
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ like @-@ <unk> . <eol> = = = <unk>
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 2 : 23

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [5/10] 	Loss: 5.1899 	Lr: 0.000655
[VAL] 	Epoch [5/10] 	Loss: 4.9721
4.9720583
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used to be the <unk> of the <unk> .
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = <eol> The first of the season , the
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old @-@ time with a <unk> <unk>
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> <unk> , and the
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ day @-@ up . <eol> = = = <unk>
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 2 : 23 ; 

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [6/10] 	Loss: 5.0851 	Lr: 0.000501
[VAL] 	Epoch [6/10] 	Loss: 4.8943
4.8942766
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used in the <unk> <unk> . <eol> = =
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = = <eol> The first of the first time
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old @-@ <unk> , and the <unk>
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> <unk> , and the
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ year @-@ old . <eol> = = = =
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 2 : 23 ; twen

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [7/10] 	Loss: 5.0053 	Lr: 0.000346
[VAL] 	Epoch [7/10] 	Loss: 4.8893
4.889266
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used to be the first to be a <unk>
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = <eol> The first of the season , the
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old @-@ time " . <eol> =
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> <unk> , and the
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ based on the <unk> of the <unk> . <eol>
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 2 : 23 ; 

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [8/10] 	Loss: 4.9447 	Lr: 0.000207
[VAL] 	Epoch [8/10] 	Loss: 4.8991
4.8990984
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used to be a <unk> , and the <unk>
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = = <eol> The first of the season ,
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old @-@ game against the <unk> .
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> . <eol> = =
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ year contract with the <unk> of the <unk> .
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam at 2 

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [9/10] 	Loss: 4.9041 	Lr: 0.000096
[VAL] 	Epoch [9/10] 	Loss: 4.8921
4.892123
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | not only to be the first to be a <unk>
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = <eol> The first of the season , the
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old @-@ time with a <unk> <unk>
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> <unk> , and <unk>
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ day Category 1 hurricane . The first time in
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Guam

  0%|          | 0/3245 [00:00<?, ?it/s]

[TRAIN] 	Epoch [10/10] 	Loss: 4.8811 	Lr: 0.000025
[VAL] 	Epoch [10/10] 	Loss: 4.8904
4.8903923
Input | Output #0: <sos> while the group was en route , but only three were ultimately able to attack . None of them were | also used to be the first to be a <unk>
Input | Output #1: <sos> <unk> , where he remained on loan until 30 June 2010 . <eol> = = = Return to Manchester United | = = <eol> The first of the season , the
Input | Output #2: <sos> 25 April 2013 , denoting shipments of 500 @,@ 000 copies . <eol> The song became One Direction 's fourth | @-@ year @-@ old @-@ time with a <unk> <unk>
Input | Output #3: <sos> , and Bruce R. ) one daughter ( Wendy J. <unk> ) and two grandchildren , died in <unk> , | and the <unk> of the <unk> <unk> , and the
Input | Output #4: <sos> Warrior were examples of this type . Because their armor was so heavy , they could only carry a single | @-@ day Category 1 hurricane . The first time in
Input | Output #5: <sos> the embassy at 1 : 49 and landed on Gu