# Task03. 下游任务的两阶段处理

# 1. 什么是两阶段 (two-stage) 处理

### 一般采集到的时序数据可能会有"瑕疵", 例如包含缺失值, 异常数据点和样本等等, 虽然这些"瑕疵"也属于数据的特性 (例如缺失值的模式可能代表了设备的异常情况), 但在一般的建模中是有害的不被我们需要的, 会阻碍算法的学习甚至让我们无法使用神经网络直接对其进行端到端的建模 (例如很多的时序预测算法和分类算法无法处理缺失值), 这时就需要我们在上游先对时序进行有特别目的的再处理, 然后进行下游任务(例如预测, 分类等)建模, 这样的建模处理分就被分为了两个阶段.

# 2. 使用LSTM来对Task02中插补后保存的数据集进行分类f

### 2.1 模型与训练函数编写

In [1]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from copy import deepcopy

# 设置模型的运行设备为cpu, 如果你有gpu设备可以设置为cuda
DEVICE='cpu'

class LoadImputedDataAndLabel(Dataset):
    def __init__(self, imputed_data, labels):
        self.imputed_data = imputed_data
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return (
            torch.from_numpy(self.imputed_data[idx]).to(torch.float32),
            torch.tensor(self.labels[idx]).to(torch.long),
        )


class ClassificationLSTM(torch.nn.Module):
    def __init__(self, n_features, rnn_hidden_size, n_classes):
        super().__init__()
        self.rnn = torch.nn.LSTM(
            n_features,
            hidden_size=rnn_hidden_size,
            batch_first=True,
        )
        self.fcn = torch.nn.Linear(rnn_hidden_size, n_classes)

    def forward(self, data):
        hidden_states, _ = self.rnn(data)
        logits = self.fcn(hidden_states[:, -1, :])
        prediction_probabilities = torch.sigmoid(logits)
        return prediction_probabilities


def train(model, train_dataloader, val_dataloader, test_loader):
    n_epochs = 20
    patience = 5
    optimizer = torch.optim.Adam(model.parameters(), 1e-3)
    current_patience = patience
    best_loss = float("inf")
    for epoch in range(n_epochs):
        model.train()
        for idx, data in enumerate(train_dataloader):
            X, y = map(lambda x: x.to(DEVICE), data)
            optimizer.zero_grad()
            probabilities = model(X)
            loss = F.cross_entropy(probabilities, y.reshape(-1))
            loss.backward()
            optimizer.step()

        model.eval()
        loss_collector = []
        with torch.no_grad():
            for idx, data in enumerate(val_dataloader):
                X, y = map(lambda x: x.to(DEVICE), data)
                probabilities = model(X)
                loss = F.cross_entropy(probabilities, y.reshape(-1))
                loss_collector.append(loss.item())

        loss = np.asarray(loss_collector).mean()
        if best_loss > loss:
            current_patience = patience
            best_loss = loss
            best_model = deepcopy(model.state_dict())
        else:
            current_patience -= 1

        if current_patience == 0:
            break

    model.load_state_dict(best_model)
    model.eval()

    probability_collector = []
    for idx, data in enumerate(test_loader):
        X, y = map(lambda x: x.to(DEVICE), data)
        probabilities = model.forward(X)
        probability_collector += probabilities.cpu().tolist()

    probability_collector = np.asarray(probability_collector)
    return probability_collector


def get_dataloaders(train_X, train_y, val_X, val_y, test_X, test_y, batch_size=128):
    train_set = LoadImputedDataAndLabel(train_X, train_y)
    val_set = LoadImputedDataAndLabel(val_X, val_y)
    test_set = LoadImputedDataAndLabel(test_X, test_y)
    train_loader = DataLoader(train_set, batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size, shuffle=False)
    test_loader = DataLoader(test_set, batch_size, shuffle=False)
    return train_loader, val_loader, test_loader


### 2.2 数据加载

In [2]:
from pypots.data.saving import pickle_load

# 重新加载插补后的数据
imputed_physionet2012 = pickle_load('result_saving/imputed_physionet2012.pkl')
train_X, val_X, test_X = imputed_physionet2012['train_set_imputation'], imputed_physionet2012['val_set_imputation'], imputed_physionet2012['test_set_imputation']
# 这里我们只需要原数据集中相应样本的标签
train_y, val_y, test_y = imputed_physionet2012['train_set_labels'], imputed_physionet2012['val_set_labels'], imputed_physionet2012['test_set_labels']

# 转换成torch dataloader
train_loader, val_loader, test_loader = get_dataloaders(
    train_X, 
    train_y, 
    val_X, 
    val_y, 
    test_X, 
    test_y,
)

  from .autonotebook import tqdm as notebook_tqdm


[34m
████████╗██╗███╗   ███╗███████╗    ███████╗███████╗██████╗ ██╗███████╗███████╗    █████╗ ██╗
╚══██╔══╝██║████╗ ████║██╔════╝    ██╔════╝██╔════╝██╔══██╗██║██╔════╝██╔════╝   ██╔══██╗██║
   ██║   ██║██╔████╔██║█████╗█████╗███████╗█████╗  ██████╔╝██║█████╗  ███████╗   ███████║██║
   ██║   ██║██║╚██╔╝██║██╔══╝╚════╝╚════██║██╔══╝  ██╔══██╗██║██╔══╝  ╚════██║   ██╔══██║██║
   ██║   ██║██║ ╚═╝ ██║███████╗    ███████║███████╗██║  ██║██║███████╗███████║██╗██║  ██║██║
   ╚═╝   ╚═╝╚═╝     ╚═╝╚══════╝    ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝╚══════╝╚══════╝╚═╝╚═╝  ╚═╝╚═╝
ai4ts v0.0.2 - building AI for unified time-series analysis, https://time-series.ai [0m



### 2.3 训练分类模型

In [4]:
# 初始化LSTM分类器
rnn_classifier = ClassificationLSTM(
    n_features=37,
    rnn_hidden_size=128,
    n_classes=2, # physionet2012是一个二分类数据集
)
# 训练LSTM分类器
proba_predictions = train(rnn_classifier, train_loader, val_loader, test_loader)

### 2.4 计算分类精度

In [5]:
from pypots.nn.functional.classification import calc_binary_classification_metrics

pos_num = test_y.sum()
neg_num = len(test_y) - test_y.sum()
print(f'test_set中的正负样本比例为{pos_num}:{neg_num}, 正样本占样本数量的{pos_num/len(test_y)}, 所以这是一个不平衡的二分类问题, 故我们在此使用ROC-AUC和PR-AUC作为评价指标\n')

classification_metrics=calc_binary_classification_metrics(
    proba_predictions, test_y
)
print(f"LSTM在测试集上的ROC-AUC为: {classification_metrics['roc_auc']:.4f}\n")
print(f"LSTM在测试集上的PR-AUC为: {classification_metrics['pr_auc']:.4f}\n")

test_set中的正负样本比例为111:689, 正样本占样本数量的0.13875, 所以这是一个不平衡的二分类问题, 故我们在此使用ROC-AUC和PR-AUC作为评价指标

LSTM在测试集上的ROC-AUC为: 0.6626

LSTM在测试集上的PR-AUC为: 0.3177



# 3. 阅读材料

### Ralf C. Staudemeyer, Eric Rothstein Morris. (2019). [Understanding LSTM -- a tutorial into Long Short-Term Memory Recurrent Neural Networks](https://arxiv.org/abs/1909.09586). arXiv.
#### 推荐原因: 作为老牌的时序算法, LSTM一直都有不少论文在讨论. 该文从类似tutorial的角度讲解了LSTM模型, 对初学者较为友好, 尤其适合刚入学的时序领域的研究生们. 截止2025年5月Google Scholar上引用1300+.