# Task06. 自定义数据集下游任务的两种分析方法

在本节中，我们将基于 **Task05中自定义的数据集** 进行下游任务分析，主要包括：

1. 使用LSTM基于插补后的数据集进行分类
2. 基于PyPOTS中TimesNet模型对带缺失值的原数据进行端到端学习的分类

## 1. 基于插补数据的基线分类分析

### 1.1 加载插补后的数据集

首先，从之前保存的插补结果文件中加载训练集、验证集和测试集：

In [1]:
from pypots.data.saving import pickle_load

imputed_dataset = pickle_load('result_saving/imputed_synthetic_eicu.pkl')

train_X, val_X, test_X = imputed_dataset['train_X'], imputed_dataset['val_X'], imputed_dataset['test_X']
train_y, val_y, test_y = imputed_dataset['train_y'], imputed_dataset['val_y'], imputed_dataset['test_y']

  from .autonotebook import tqdm as notebook_tqdm


[34m
████████╗██╗███╗   ███╗███████╗    ███████╗███████╗██████╗ ██╗███████╗███████╗    █████╗ ██╗
╚══██╔══╝██║████╗ ████║██╔════╝    ██╔════╝██╔════╝██╔══██╗██║██╔════╝██╔════╝   ██╔══██╗██║
   ██║   ██║██╔████╔██║█████╗█████╗███████╗█████╗  ██████╔╝██║█████╗  ███████╗   ███████║██║
   ██║   ██║██║╚██╔╝██║██╔══╝╚════╝╚════██║██╔══╝  ██╔══██╗██║██╔══╝  ╚════██║   ██╔══██║██║
   ██║   ██║██║ ╚═╝ ██║███████╗    ███████║███████╗██║  ██║██║███████╗███████║██╗██║  ██║██║
   ╚═╝   ╚═╝╚═╝     ╚═╝╚══════╝    ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝╚══════╝╚══════╝╚═╝╚═╝  ╚═╝╚═╝
ai4ts v0.0.2 - building AI for unified time-series analysis, https://time-series.ai [0m



### 1.2 基于LSTM进行分类

In [2]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from copy import deepcopy

# 设置模型的运行设备为cpu, 如果你有gpu设备可以设置为cuda
DEVICE='cpu'

class LoadImputedDataAndLabel(Dataset):
    def __init__(self, imputed_data, labels):
        self.imputed_data = imputed_data
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return (
            torch.from_numpy(self.imputed_data[idx]).to(torch.float32),
            torch.tensor(self.labels[idx]).to(torch.long),
        )


class ClassificationLSTM(torch.nn.Module):
    def __init__(self, n_features, rnn_hidden_size, n_classes):
        super().__init__()
        self.rnn = torch.nn.LSTM(
            n_features,
            hidden_size=rnn_hidden_size,
            batch_first=True,
        )
        self.fcn = torch.nn.Linear(rnn_hidden_size, n_classes)

    def forward(self, data):
        hidden_states, _ = self.rnn(data)
        logits = self.fcn(hidden_states[:, -1, :])
        prediction_probabilities = torch.sigmoid(logits)
        return prediction_probabilities


def train(model, train_dataloader, val_dataloader, test_loader):
    n_epochs = 20
    patience = 5
    optimizer = torch.optim.Adam(model.parameters(), 1e-3)
    current_patience = patience
    best_loss = float("inf")
    for epoch in range(n_epochs):
        model.train()
        for idx, data in enumerate(train_dataloader):
            X, y = map(lambda x: x.to(DEVICE), data)
            optimizer.zero_grad()
            probabilities = model(X)
            loss = F.cross_entropy(probabilities, y.reshape(-1))
            loss.backward()
            optimizer.step()

        model.eval()
        loss_collector = []
        with torch.no_grad():
            for idx, data in enumerate(val_dataloader):
                X, y = map(lambda x: x.to(DEVICE), data)
                probabilities = model(X)
                loss = F.cross_entropy(probabilities, y.reshape(-1))
                loss_collector.append(loss.item())

        loss = np.asarray(loss_collector).mean()
        if best_loss > loss:
            current_patience = patience
            best_loss = loss
            best_model = deepcopy(model.state_dict())
        else:
            current_patience -= 1

        if current_patience == 0:
            break

    model.load_state_dict(best_model)
    model.eval()

    probability_collector = []
    for idx, data in enumerate(test_loader):
        X, y = map(lambda x: x.to(DEVICE), data)
        probabilities = model.forward(X)
        probability_collector += probabilities.cpu().tolist()

    probability_collector = np.asarray(probability_collector)
    return probability_collector


def get_dataloaders(train_X, train_y, val_X, val_y, test_X, test_y, batch_size=128):
    train_set = LoadImputedDataAndLabel(train_X, train_y)
    val_set = LoadImputedDataAndLabel(val_X, val_y)
    test_set = LoadImputedDataAndLabel(test_X, test_y)
    train_loader = DataLoader(train_set, batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size, shuffle=False)
    test_loader = DataLoader(test_set, batch_size, shuffle=False)
    return train_loader, val_loader, test_loader


In [3]:
# 转换成torch dataloader
train_loader, val_loader, test_loader = get_dataloaders(
    train_X, 
    train_y, 
    val_X, 
    val_y, 
    test_X, 
    test_y,
)

rnn_classifier = ClassificationLSTM(
    n_features=imputed_dataset['n_features'],
    rnn_hidden_size=128,
    n_classes=2, # physionet2012是一个二分类数据集
)
proba_predictions = train(rnn_classifier, train_loader, val_loader, test_loader)

In [4]:
from pypots.nn.functional.classification import calc_binary_classification_metrics

pos_num = test_y.sum()
neg_num = len(test_y) - test_y.sum()
print(f'test_set中的正负样本比例为{pos_num}:{neg_num}, 正样本占样本数量的{pos_num/len(test_y)}, 所以这是一个不平衡的二分类问题, 故我们在此使用ROC-AUC和PR-AUC作为评价指标\n')

classification_metrics=calc_binary_classification_metrics(
    proba_predictions, test_y
)
print(f"LSTM在测试集上的ROC-AUC为: {classification_metrics['roc_auc']:.4f}\n")
print(f"LSTM在测试集上的PR-AUC为: {classification_metrics['pr_auc']:.4f}\n")

test_set中的正负样本比例为62:552, 正样本占样本数量的0.10097719869706841, 所以这是一个不平衡的二分类问题, 故我们在此使用ROC-AUC和PR-AUC作为评价指标

LSTM在测试集上的ROC-AUC为: 0.6264

LSTM在测试集上的PR-AUC为: 0.2690



## 2. 使用PyPOTS中的TimesNet模型对自定义数据集进行基于端到端的时序建模与分类分析

### 2.1 数据集准备

In [5]:
import pypots
import numpy as np
import pandas as pd
import tsdb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 设置模型的运行设备为cpu, 如果你有gpu设备可以设置为cuda
DEVICE='cpu'

from pypots.data.saving import pickle_load
processed_dataset = pickle_load('result_saving/processed_synthetic_eicu.pkl')

# 组装训练集
dataset_for_training = {
    "X": processed_dataset['train_X'],
    'y': processed_dataset['train_y'],
}

# 组装验证集
dataset_for_validating = {
    "X": processed_dataset['val_X'],
    "y": processed_dataset['val_y'],
}

# 组装测试集
dataset_for_testing = {
    "X": processed_dataset['test_X'],
    "y": processed_dataset['test_y'],
  }

### 2.2 TimesNet建模分析

In [6]:
from pypots.nn.functional import calc_mae
from pypots.optim import Adam
from pypots.classification import TimesNet

# 创建 TimesNet 模型
# TimesNet 模型的参数可以根据需要进行调整
timesnet = TimesNet(
    n_steps=processed_dataset['n_steps'],
    n_features=processed_dataset['n_features'],
    n_classes=processed_dataset['n_classes'],
    n_layers=2,
    top_k=3,
    d_model=64,
    d_ffn=128,
    n_kernels=3,
    dropout=0.3,
    batch_size=32,
    # 这里为了快速演示我们将epochs设置为20，你可以将其设置为100或更多以获得更好的结果
    epochs=20,
    # 这里我们设置patience=5，如果连续5个epoch的评估loss没有减少，则提前停止训练。你可以不设置它,则默认为None,禁用早停机制
    patience=5,
    # 设置优化器。不同于torch.optim。在初始化pypots.optimizer时，你不必指定模型的参数。您也可以不设置它, 它将默认初始化一个lr=0.001的Adam优化器。
    optimizer=Adam(lr=1e-3),
    # 这个num_workers参数用于torch.utils.data.Dataloader。它是用于数据加载的子进程的数量。让它默认为0意味着数据加载将在主进程中，即不会有子进程。如果你认为数据加载是模型训练速度的瓶颈，则可以将其增加
    num_workers=0,
    # 如果不设置device, PyPOTS将自动为你分配最佳设备。这里我们将其设置为“cpu”。你也可以设置为'cuda', ‘cuda:0’或‘cuda:1’，如果你有多个cuda设备，甚至并行['cuda:0', 'cuda:1']
    device=DEVICE,
    # 设置保存tensorboard和训练模型文件的路径
    saving_path="result_saving/classification/timesnet",
    # 训练完成后只保存最好的模型。你还可以将其设置为“better”，以保存在训练期间每一次在val set上表现得比之前更好的模型
    model_saving_strategy="best",
)

# 训练阶段，使用训练集和验证集
timesnet.fit(train_set=dataset_for_training, val_set=dataset_for_validating)


2025-05-11 00:02:18 [INFO]: Using the given device: cpu
2025-05-11 00:02:18 [INFO]: Model files will be saved to result_saving/classification/timesnet/20250511_T000218
2025-05-11 00:02:18 [INFO]: Tensorboard file will be saved to result_saving/classification/timesnet/20250511_T000218/tensorboard
2025-05-11 00:02:18 [INFO]: Using customized CrossEntropy as the training loss function.
2025-05-11 00:02:18 [INFO]: Using customized CrossEntropy as the validation metric function.
2025-05-11 00:02:18 [INFO]: TimesNet initialized with the given hyperparameters, the number of trainable parameters: 1,158,274
2025-05-11 00:03:02 [INFO]: Epoch 001 - training loss (CrossEntropy): 0.3349, validation CrossEntropy: 0.2572
2025-05-11 00:03:47 [INFO]: Epoch 002 - training loss (CrossEntropy): 0.2741, validation CrossEntropy: 0.2534
2025-05-11 00:04:40 [INFO]: Epoch 003 - training loss (CrossEntropy): 0.2626, validation CrossEntropy: 0.2721
2025-05-11 00:05:26 [INFO]: Epoch 004 - training loss (CrossEntr

In [7]:
from pypots.nn.functional.classification import calc_binary_classification_metrics

timesnet_results = timesnet.predict(dataset_for_testing)
timesnet_prediction = timesnet_results["classification"]

classification_metrics=calc_binary_classification_metrics(
    timesnet_prediction, dataset_for_testing["y"]
)
print(f"TimesNet在测试集上的ROC-AUC为: {classification_metrics['roc_auc']:.4f}\n")
print(f"TimesNet在测试集上的PR-AUC为: {classification_metrics['pr_auc']:.4f}\n")

TimesNet在测试集上的ROC-AUC为: 0.5743

TimesNet在测试集上的PR-AUC为: 0.4171



# 3. 阅读材料

### Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., & Long, M. (2023). [TimesNet: Temporal 2d-variation modeling for general time series analysis](https://openreview.net/forum?id=ju_Uqw384Oq). *ICLR 2023*

#### 推荐原因: 该文提出了适用于多个时序分析任务的模型结构TimesNet. 文章被人工智能顶级会议ICLR 2023收录. 截止2025年5月Google Scholar上引用1300+.